目录

⚡️BytevalKit_LLM: One-Stop LLM Evaluation Tool

Build Build License Build

Overview | Key Features | Installation | Quick Start | Configuration | System Architecture | Benchmark | Contributing | License

English | 中文

Overview

BytevalKit_LLM is a comprehensive evaluation framework designed to assess the performance of Large Language Models (LLMs). It enables full-process customization of evaluation tasks through declarative YAML configuration, supports multiple model deployment methods (API, Huggingface, vLLM), and provides a complete “inference-evaluation-scoring” automated workflow.

The framework adopts a “Configuration-as-Code” design philosophy, abstracting key components such as model architecture, inference logic, evaluation methods, and metric calculations into configurable items. This allows differences between evaluation tasks to be implemented by modifying YAML configuration files rather than repetitive code development.

Key Features

🚀 Multi-Mode Support

  • API Mode: OpenAI API-compatible LLM inference service
  • Local Inference: Supports Transformer-based model loading and inference
  • vLLM Acceleration: Integration with vLLM high-performance inference engine
  • Batch Processing: Support for batch inference and concurrent evaluation

🎯 Flexible Evaluation Methods

  • Rule-based Evaluation: Supports custom rule-based evaluation for result judgment
  • LLM-as-Judge: Support for using LLMs as evaluators
  • CoT Evaluation: Support for Chain-of-Thought reasoning process evaluation
  • Multi-dimensional Assessment: Support for dimensions like “user requirement satisfaction, clarity, completeness, factual correctness, strict accuracy”

📊 Data Format Support

  • JSON / JSONL format
  • CSV format
  • Custom data preprocessing (via exec_code)
  • Direct import from Hugging Face datasets

⚡ Efficient Execution

  • Intelligent GPU resource allocation
  • Multi-task parallel execution
  • Support for checkpoint resumption

Installation

Requirements

  • Python 3.9+
  • CUDA 11.8+ (required for GPU inference)

Installation Steps

# Clone repository
git clone https://github.com/BytevalKit/BytevalKit_LLM.git
cd BytevalKit_LLM

# Install dependencies
pip install -r requirements.txt

Quick Start

Basic Usage

# Run evaluation with default configuration
python3 main.py --yaml_path demo/demo.yaml

# Run CMMLU evaluation
python3 main.py --yaml_path demo/cmmlu.yaml

# Run multi-task evaluation
python3 main.py --yaml_path demo/multi_task_gpu_invocation.yaml

Example Configurations

We provide multiple example configurations in the demo/ directory:

  • demo/Qwen2.5-1.5B-Instruct.yaml - Local model inference example
  • demo/cot_model_eval.yaml - CoT chain-of-thought evaluation example
  • demo/single_model_vllm_inference_eval.yaml - vLLM inference example
  • demo/multi_task_gpu_invocation.yaml - Multi-task parallel example

Configuration

YAML Configuration Structure

Configuration files contain three main sections:

1. DEFAULT - Task Configuration

DEFAULT:
  work_dir: /path/to/output        # Output directory
  task_name: my_eval_task          # Task name
  need_judge: true                 # Whether evaluation is needed
  need_calculate: true             # Whether to calculate final scores
  need_cot_judge: true            # Whether to evaluate CoT chain-of-thought
  use_vllm: true                  # Whether to use vLLM
  batch_size: 10                  # Batch size
  judge_workers: 10               # Number of judge workers
  vllm_cfg:
    max_seq_len: 16384
    max_out_len: 2048

2. DATASET - Dataset Configuration

DATASET:
  my_dataset:
    name: dataset_display_name
    path: /path/to/dataset.json
    question_key: input           # Question field name (default: input)
    answer_key: target           # Answer field name (default: target)
    judge_type: rule_comparison  # Judge type: rule_comparison or llm_judge
    
    # Custom data preprocessing
    exec_code: |+
      question = item['question'] + " Options: " + str(item['options'])
      answer = item['answer']
    
    # Custom evaluation rules (for rule_comparison)
    judge_code: |+
      import re
      match = re.search(r'The answer is\s*([A-D])', item.prediction)
      if match:
          judge_result = match.group(1) == item.answer
    
    # LLM judge prompt (for llm_judge)
    judge_prompt: "Please evaluate the quality of the following response..."
    
    # Evaluation dimension filtering
    filter_key: ["clarity", "completeness"]  # Keep only specified dimensions

3. MODEL - Model Configuration

MODEL:
  # API model example
  gpt4:
    type: api
    name: gpt-4
    api_key: ${API_KEY}
    
  # Local model example
  qwen2_5:
    type: vllm  # Or unspecified, defaults to transformers
    name: Qwen2.5-7B-Instruct
    path: /path/to/model
    model_kwargs:
      torch_dtype: bfloat16
      trust_remote_code: true
    meta_template:
      SYSTEM_begin: "<|im_start|>system\n"
      SYSTEM_end: "<|im_end|>\n"
      HUMAN_begin: "<|im_start|>user\n"
      HUMAN_end: "<|im_end|>\n"
      BOT_begin: "<|im_start|>assistant\n"
      BOT_end: "<|im_end|>\n"

System Architecture

Execution Flow

View Flow Diagram
sequenceDiagram
    participant Config System
    participant Task Scheduler
    participant Model Inference
    participant Judge Service
    participant Metric Calculator
    
    Config System->>Task Scheduler: Load Configuration
    Task Scheduler->>Model Inference: Assign Tasks
    Model Inference->>Judge Service: Send Inference Results
    Judge Service-->>Metric Calculator: Return Scores
    Metric Calculator->>Output System: Generate Report

Architecture Diagram

Click to view larger image

Click image to view larger size

Directory Structure

BytevalKit_LLM/
├── main.py              # Main entry point
├── models/              # Model interfaces
│   ├── api_model.py     # API model interface
│   ├── hf_model.py      # Huggingface models
│   └── vllm_model.py    # vLLM models
├── execute/             # Execution engine
│   ├── infer.py           # Inference module
│   ├── judge.py           # Evaluation module  
│   └── task.py            # Task management
├── demo/                # Example configs and data
    ├── dataset/         # Example datasets
    └── *.yaml          # Example configuration files

Benchmark

Note: To demonstrate that our framework is applicable to open-source dataset evaluation methods, we validate our framework using open-source models on selected evaluation sets, with all evaluation logic based on LLM evaluation.

The following are framework evaluation results only, with models listed in no particular order.

Dataset Metric Qwen3_32B qwen3-14b qwen3-235b
-a22b
DeepSeek-V3-671B qwen1.7B-instruct Qwen3_8B Qwen2.5_1.5B Qwen2.5_7B
AIME24 acc 33.33 26.67 46.67 33.3 16.67 23.3 3.33 13.3
AIME25 acc 20 23.33 27.5 25.83 7.5 8.33 0 15.42
C-SimpleQA acc 40.12 37.21 54.39 58.79 13.67 31.85 12.56 23.43
MATH-500 acc 75.4 75.8 87.8 71.6 71.2 83 55.8 77.8
bbh acc 87.39 84.59 88.81 87.01 55 80.3 36.2 64.3
ceval-gen acc 84.77 82.8 85.78 90 53.8 76.3 54.23 73.99
cmmlu-gen acc 73.33 77.9 82.49 79.2 53.13 75.82 66.28 73.73
hellaswag-gen acc 81.1 54.4 84.48 82.6 61 70.3 56.25 69.6
GPQA-Diamond acc 54 54.55 62.63 48.48 26.77 40.4 30.3 34.4
MMLU-Pro acc 72.86 67.14 78.58 78.57 41.43 74.3 32.14 62.85

Dataset Acknowledgments

We thank the following open-source datasets for their contributions. Formatted versions are available in the demo/dataset/ directory:

  • AIME2024
  • C-SimpleQA
  • MATH-500
  • bbh
  • ceval-gen
  • cmmlu-gen
  • hellaswag
  • GPQA-Diamond
  • MMLU-Pro

Custom Extensions

Adding API Models

Modify the models/api_model.py file to add your API interface implementation.

Custom LLM Judge

Modify execute/judge.pyadjust ‘llm_judge’ to implement your evaluation logic.

Concurrency Optimization

Adding concurrent request capabilities to the evaluation module can significantly improve evaluation speed.

Contributing

This project is developed by the BytevalKit team, development members:

{Peijie Bu, Yan Qiu, Shenwei Huang}, Yaling Mou, Xianxian Ma, 
Ming Jiang, Haizhen Liao, Jingwei Sun, Binbin Xing

{*} Equal Contributions.

We also thank the Bytedance Douyin Content Team for their support:

Xusheng Wang, Fubang Zhao, Jianhui Pang, Mingsi Ye, Jie Tang, Kang Yang, Xiaopu Wang, Shuang Zeng
Fei Jiang, Ying Ju, Chuang Fan, Chuwei Luo, Qingsong Liu, Xu Chen
Yi Lin, Junfeng Yao, Chao Feng, Jiao Ran

And the support provided by Product design and Byteval platform team:

Ziyu Shi, Zhao Lin, Yang Li, Jing Yang, Zhen Wang, Guojun Ma

And from AI platform team:

Huiyu Yu, Lin Dong, Yong Zhang

We welcome contributions of all kinds! Please check our Contributing Guide for details.

Special thanks to OpenCompass for their open-source framework, which provided valuable design insights.

Citation

If you use BytevalKit_LLM in your research, please cite:

@misc{BytevalKit-LLM-2025,
  title={BytevalKit-LLM: Comprehensive LLM Evaluation Framework},
  author={BytevalKit},
  year={2025},
  howpublished={\url{https://github.com/BytevalKit/BytevalKit_LLM}}
}

License

BytevalKit-LLM is licensed under the Apache License 2.0.

Contact Us

If you have any questions, feel free to contact us at: BytevalKit@bytedance.com

关于
6.2 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号