目录

AACR-Bench

License arXiv HuggingFace

English | 简体中文

📋 Introduction

AACR-Bench is the industry’s first multilingual, repository-level context-aware code review evaluation dataset, designed to assess the performance of large language models in automated code review tasks. The dataset comprises 200 real Pull Requests from 50 active open-source projects, covering 10 mainstream programming languages. Each instance includes not only code changes but also preserves complete repository context, authentically reproducing the entire code review process. Through human-LLM collaborative review combined with multi-round expert annotation, we ensure high quality and comprehensiveness of the data. AACR-Bench Overview

✨ Core Features

🌍 Multi-language Coverage

Covers 10 mainstream programming languages used in projects:

  • System-level languages: C++, Rust, Go
  • Enterprise languages: Java, C#, TypeScript
  • Scripting languages: Python, JavaScript, Ruby, PHP

📁 Repository-level Context

  • Preserves complete project structure
  • Supports cross-file references and inter-module interaction analysis
  • Includes PR metadata (description, title, comments, etc.)

🤖 Human Expert + LLM Enhanced Annotation

Professional Annotation Team

  • 80+ senior software engineers with 2+ years of experience
  • Covers frontend, backend, architecture, and other domains
  • Three rounds of cross-validation

LLM Intelligent Enhancement

  • Systematic issue discovery
  • Comprehensive defect identification
  • Improvement suggestion generation

Quality Assurance Process: GitHub human comments → LLM enhancement → Expert multi-round cross-annotation → Consistency validation

🎯 Evaluation Capabilities and Applications

AACR-Bench provides systematic evaluation capabilities across four core dimensions, supporting diverse research and application scenarios:

Evaluation Dimension System

1️⃣ Multi-language Evaluation (10 Languages) 2️⃣ Positioning Accuracy Evaluation (Line-level)
Cross-language performance comparison: Identify model strengths and weaknesses across languages
Language-specific optimization: Improve model capabilities for specific languages
Generalization assessment: Test model’s language transfer effectiveness
Precise positioning: Assess single-line/multi-line issue location accuracy
Cross-file tracking: Test cross-file reference identification capability
Context boundaries: Verify issue scope judgment accuracy
3️⃣ Issue Classification Evaluation (4 Categories) 4️⃣ Context Understanding Evaluation (3 Levels)
Classification accuracy: Assess issue type identification capability
Severity assessment: Test issue priority judgment
Specialized capabilities: Analyze detection rates for specific issue types
Diff-level understanding: Basic code change analysis
File-level understanding: Complete file logic comprehension
Repo-level understanding: Project-wide dependency analysis

Typical Application Scenarios

Model Development

  • Performance benchmarking: Evaluate new models’ code review capabilities using unified standards
  • Weakness analysis: Discover model shortcomings through fine-grained metrics (e.g., low recall in certain languages)
  • Iterative optimization validation: Quantify improvement effects, guide continuous model optimization

Academic Research

  • Comparative studies: Fair comparison of different model architectures/training methods
  • Ablation experiments: Analyze the impact of different context levels on review quality
  • New method validation: Provide standardized evaluation environment for innovative algorithms

Engineering Practice

  • Model selection: Help teams choose review models suitable for project characteristics
  • Pre-deployment validation: Ensure model reliability in production environments
  • Continuous monitoring: Track model performance changes in actual use

🚀 Quick Start

Clone Repository and Download Dataset

git clone https://github.com/alibaba/aacr-bench.git
cd aacr-bench

Install Dependencies

pip install -r requirements.txt

Configure Claude CLI

Edit configs/config.json and set the Claude CLI installation path:

{
  "cli_path": "your_path_to_claude.cmd",
  "data_path": "your_path_to_positive_samples.json"
}

Configure Evaluation Environment Variables

Create a .env file in the evaluator_runner/utils/ directory:

LLM_MODEL_URL="your_llm_model_url"
LLM_MODEL="your_llm_model"
LLM_API_KEY="your_llm_api_key"

EMBEDDING_MODEL_URL="your_embedding_model_url"
EMBEDDING_MODEL="your_embedding_model"
EMBEDDING_API_KEY="your_embedding_api_key"

Prepare Dataset

For the first run, you need to convert the raw data to task format. Uncomment and run in main.py:

if __name__ == "__main__":
    load_data_as_task()  # First run: generate task file

This will:

  • Read the raw dataset (specified by data_path)
  • Add finish flag to each PR for progress tracking
  • Generate tmp_data.json task file

Run Code Review

cd claude-code-demo
python main.py

Run Evaluation

import asyncio
from evaluator_runner import (
    get_evaluator_ans_from_json,
    load_generated_comments_from_file,
    EvaluatorConfig
)

async def main():
    # Load AI-generated comments to evaluate
    generated_comments = load_generated_comments_from_file("path/to/comments.txt")
    
    # Reference comments (load from positive_samples.json)
    reference_comments = [...]
    
    # Run evaluation
    result = await get_evaluator_ans_from_json(
        github_pr_url="https://github.com/owner/repo/pull/123",
        generated_comments=generated_comments,
        good_comments=reference_comments
    )
    
    print(f"Location Match Rate: {result['positive_line_match_rate']}")
    print(f"Semantic Match Rate: {result['positive_match_rate']}")

asyncio.run(main())

Batch Evaluation

# Configure parameters in evaluator_runner/example_test.py, then run:
python evaluator_runner/example_test.py

📈 Data Overview

Dataset Scale

200

Pull Requests

10

Programming Languages

50

Source Projects

2,145

Review Comments

Classification Statistics

Language Distribution

Data Format

{
  "type": "array",
  "item": {
    "change_line_count": {"type": "integer", "description": "Number of changed lines"},
    "project_main_language": {"type": "string", "description": "Project main language"},
    "source_commit": {"type": "string", "description": "Source commit"},
    "target_commit": {"type": "string", "description": "Target commit"},
    "githubPrUrl": {"type": "string", "description": "GitHub PR URL"},
    "comments": {
      "is_ai_comment": {"type": "boolean", "description": "Whether it's an AI comment"},
      "note": {"type": "string", "description": "Comment content"},
      "path": {"type": "string", "description": "File path"},
      "side": {"type": "string", "description": "Comment anchor position"},
      "source_model": {"type": "string", "description": "Source model"},
      "from_line": {"type": "integer", "description": "Start line number"},
      "to_line": {"type": "integer", "description": "End line number"},
      "category": {"type": "string", "description": "Issue category: Security/Defect/Maintainability/Performance"},
      "context": {"type": "string", "description": "Comment scope: diff/file/repo level"}
    }
  }
}

📏 Evaluation Metrics

We employ a multidimensional metric system to comprehensively evaluate code review model performance. For complete metric definitions, calculation methods, and language-specific statistics, please refer to metrics.md.

Core Metrics

Metric Description Formula
Precision Proportion of valid comments generated Valid matches / Total generated
Recall Ability to discover annotated issues Valid matches / Dataset valid count
Line Precision Ability to precisely locate code lines Line matches / Total generated
Noise Rate Proportion of invalid or incorrect comments Unmatched / Total generated

🤝 Contributing

We welcome community contributions! If you want to contribute to AACR-Bench, please follow these steps:

  1. Fork this repository
  2. Create feature branch (git checkout -b feat/add-new-prs)
  3. Commit changes (git commit -m 'feat: add new PRs')
  4. Push to branch (git push origin feat/add-new-prs)
  5. Create Pull Request

For detailed contribution guidelines, please refer to CONTRIBUTING.md

👥 Authors and Maintainers

Name GitHub Domain Responsibilities
Zhengfeng Li @lizhengfeng Project Lead Overall architecture design, technical direction
Boge Wang @wangboge Technical Consultant Technical Solution Review, technical Guidance
Lei Zhang @zhanglei Data Architecture Evaluation framework, metric system, performance optimization
Yongda Yu @yuyongda Evaluation System Data schema design, evaluation protocol, quality standards
Xinxin Guo @guoxinxin Annotation Platform Annotation system development, workflow design, quality assurance
Minghui Yu @yuminghui AI Enhancement LLM annotation pipeline, model selection, prompt optimization
Zhengqi Zhuang @zhuangzhengqi Engineering CI/CD pipeline, automated testing, deployment scripts

📄 License

This project is licensed under the Apache License 2.0. For details, please see the LICENSE file.

📚 Citation

If you use AACR-Bench in your research, please cite our paper:

@misc{zhang2026aacrbenchevaluatingautomaticcode,
      title={AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context}, 
      author={Lei Zhang and Yongda Yu and Minghui Yu and Xinxin Guo and Zhengqi Zhuang and Guoping Rong and Dong Shao and Haifeng Shen and Hongyu Kuang and Zhengfeng Li and Boge Wang and Guoan Zhang and Bangyu Xiang and Xiaobin Xu},
      year={2026},
      eprint={2601.19494},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.19494}, 
}

🗺️ Roadmap

  • v1.0 (2026.01): Initial release - 200 PRs, 10 languages

🌟 Acknowledgments

  • Thanks to all contributors who participated in data annotation, especially core contributors who completed 15+ valid annotations. Full list in CONTRIBUTORS.md.
  • Thanks to open-source project maintainers for providing original PR data.
关于
7.1 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号