BytevalKit_LLM is a comprehensive evaluation framework designed to assess the performance of Large Language Models (LLMs). It enables full-process customization of evaluation tasks through declarative YAML configuration, supports multiple model deployment methods (API, Huggingface, vLLM), and provides a complete “inference-evaluation-scoring” automated workflow.
The framework adopts a “Configuration-as-Code” design philosophy, abstracting key components such as model architecture, inference logic, evaluation methods, and metric calculations into configurable items. This allows differences between evaluation tasks to be implemented by modifying YAML configuration files rather than repetitive code development.
Key Features
🚀 Multi-Mode Support
API Mode: OpenAI API-compatible LLM inference service
Local Inference: Supports Transformer-based model loading and inference
vLLM Acceleration: Integration with vLLM high-performance inference engine
Batch Processing: Support for batch inference and concurrent evaluation
🎯 Flexible Evaluation Methods
Rule-based Evaluation: Supports custom rule-based evaluation for result judgment
LLM-as-Judge: Support for using LLMs as evaluators
CoT Evaluation: Support for Chain-of-Thought reasoning process evaluation
Multi-dimensional Assessment: Support for dimensions like “user requirement satisfaction, clarity, completeness, factual correctness, strict accuracy”
BytevalKit_LLM/
├── main.py # Main entry point
├── models/ # Model interfaces
│ ├── api_model.py # API model interface
│ ├── hf_model.py # Huggingface models
│ └── vllm_model.py # vLLM models
├── execute/ # Execution engine
│ ├── infer.py # Inference module
│ ├── judge.py # Evaluation module
│ └── task.py # Task management
├── demo/ # Example configs and data
├── dataset/ # Example datasets
└── *.yaml # Example configuration files
Benchmark
Note: To demonstrate that our framework is applicable to open-source dataset evaluation methods, we validate our framework using open-source models on selected evaluation sets, with all evaluation logic based on LLM evaluation.
The following are framework evaluation results only, with models listed in no particular order.
Dataset
Metric
Qwen3_32B
qwen3-14b
qwen3-235b -a22b
DeepSeek-V3-671B
qwen1.7B-instruct
Qwen3_8B
Qwen2.5_1.5B
Qwen2.5_7B
AIME24
acc
33.33
26.67
46.67
33.3
16.67
23.3
3.33
13.3
AIME25
acc
20
23.33
27.5
25.83
7.5
8.33
0
15.42
C-SimpleQA
acc
40.12
37.21
54.39
58.79
13.67
31.85
12.56
23.43
MATH-500
acc
75.4
75.8
87.8
71.6
71.2
83
55.8
77.8
bbh
acc
87.39
84.59
88.81
87.01
55
80.3
36.2
64.3
ceval-gen
acc
84.77
82.8
85.78
90
53.8
76.3
54.23
73.99
cmmlu-gen
acc
73.33
77.9
82.49
79.2
53.13
75.82
66.28
73.73
hellaswag-gen
acc
81.1
54.4
84.48
82.6
61
70.3
56.25
69.6
GPQA-Diamond
acc
54
54.55
62.63
48.48
26.77
40.4
30.3
34.4
MMLU-Pro
acc
72.86
67.14
78.58
78.57
41.43
74.3
32.14
62.85
Dataset Acknowledgments
We thank the following open-source datasets for their contributions. Formatted versions are available in the demo/dataset/ directory:
AIME2024
C-SimpleQA
MATH-500
bbh
ceval-gen
cmmlu-gen
hellaswag
GPQA-Diamond
MMLU-Pro
Custom Extensions
Adding API Models
Modify the models/api_model.py file to add your API interface implementation.
Custom LLM Judge
Modify execute/judge.pyadjust ‘llm_judge’ to implement your evaluation logic.
Concurrency Optimization
Adding concurrent request capabilities to the evaluation module can significantly improve evaluation speed.
Contributing
This project is developed by the BytevalKit team, development members:
{Peijie Bu, Yan Qiu, Shenwei Huang}, Yaling Mou, Xianxian Ma,
Ming Jiang, Haizhen Liao, Jingwei Sun, Binbin Xing
{*} Equal Contributions.
We also thank the Bytedance Douyin Content Team for their support:
Xusheng Wang, Fubang Zhao, Jianhui Pang, Mingsi Ye, Jie Tang, Kang Yang, Xiaopu Wang, Shuang Zeng
Fei Jiang, Ying Ju, Chuang Fan, Chuwei Luo, Qingsong Liu, Xu Chen
Yi Lin, Junfeng Yao, Chao Feng, Jiao Ran
And the support provided by Product design and Byteval platform team:
Ziyu Shi, Zhao Lin, Yang Li, Jing Yang, Zhen Wang, Guojun Ma
And from AI platform team:
Huiyu Yu, Lin Dong, Yong Zhang
We welcome contributions of all kinds! Please check our Contributing Guide for details.
Special thanks to OpenCompass for their open-source framework, which provided valuable design insights.
Citation
If you use BytevalKit_LLM in your research, please cite:
⚡️BytevalKit_LLM: One-Stop LLM Evaluation Tool
Overview | Key Features | Installation | Quick Start | Configuration | System Architecture | Benchmark | Contributing | License
English | 中文
Overview
BytevalKit_LLM is a comprehensive evaluation framework designed to assess the performance of Large Language Models (LLMs). It enables full-process customization of evaluation tasks through declarative YAML configuration, supports multiple model deployment methods (API, Huggingface, vLLM), and provides a complete “inference-evaluation-scoring” automated workflow.
The framework adopts a “Configuration-as-Code” design philosophy, abstracting key components such as model architecture, inference logic, evaluation methods, and metric calculations into configurable items. This allows differences between evaluation tasks to be implemented by modifying YAML configuration files rather than repetitive code development.
Key Features
🚀 Multi-Mode Support
🎯 Flexible Evaluation Methods
📊 Data Format Support
⚡ Efficient Execution
Installation
Requirements
Installation Steps
Quick Start
Basic Usage
Example Configurations
We provide multiple example configurations in the
demo/directory:demo/Qwen2.5-1.5B-Instruct.yaml- Local model inference exampledemo/cot_model_eval.yaml- CoT chain-of-thought evaluation exampledemo/single_model_vllm_inference_eval.yaml- vLLM inference exampledemo/multi_task_gpu_invocation.yaml- Multi-task parallel exampleConfiguration
YAML Configuration Structure
Configuration files contain three main sections:
1. DEFAULT - Task Configuration
2. DATASET - Dataset Configuration
3. MODEL - Model Configuration
System Architecture
Execution Flow
View Flow Diagram
sequenceDiagram participant Config System participant Task Scheduler participant Model Inference participant Judge Service participant Metric Calculator Config System->>Task Scheduler: Load Configuration Task Scheduler->>Model Inference: Assign Tasks Model Inference->>Judge Service: Send Inference Results Judge Service-->>Metric Calculator: Return Scores Metric Calculator->>Output System: Generate ReportArchitecture Diagram
Click image to view larger size
Directory Structure
Benchmark
-a22b
Dataset Acknowledgments
We thank the following open-source datasets for their contributions. Formatted versions are available in the
demo/dataset/directory:Custom Extensions
Adding API Models
Modify the
models/api_model.pyfile to add your API interface implementation.Custom LLM Judge
Modify
execute/judge.pyadjust ‘llm_judge’ to implement your evaluation logic.Concurrency Optimization
Adding concurrent request capabilities to the evaluation module can significantly improve evaluation speed.
Contributing
This project is developed by the BytevalKit team, development members:
We also thank the Bytedance Douyin Content Team for their support:
And the support provided by Product design and Byteval platform team:
And from AI platform team:
We welcome contributions of all kinds! Please check our Contributing Guide for details.
Special thanks to OpenCompass for their open-source framework, which provided valuable design insights.
Citation
If you use BytevalKit_LLM in your research, please cite:
License
BytevalKit-LLM is licensed under the Apache License 2.0.
Contact Us
If you have any questions, feel free to contact us at: BytevalKit@bytedance.com