AACR-Bench is the industry’s first multilingual, repository-level context-aware code review evaluation dataset, designed to assess the performance of large language models in automated code review tasks. The dataset comprises 200 real Pull Requests from 50 active open-source projects, covering 10 mainstream programming languages. Each instance includes not only code changes but also preserves complete repository context, authentically reproducing the entire code review process. Through human-LLM collaborative review combined with multi-round expert annotation, we ensure high quality and comprehensiveness of the data.
✨ Core Features
🌍 Multi-language Coverage
Covers 10 mainstream programming languages used in projects:
AACR-Bench provides systematic evaluation capabilities across four core dimensions, supporting diverse research and application scenarios:
Evaluation Dimension System
1️⃣ Multi-language Evaluation (10 Languages)
2️⃣ Positioning Accuracy Evaluation (Line-level)
• Cross-language performance comparison: Identify model strengths and weaknesses across languages • Language-specific optimization: Improve model capabilities for specific languages • Generalization assessment: Test model’s language transfer effectiveness
We employ a multidimensional metric system to comprehensively evaluate code review model performance. For complete metric definitions, calculation methods, and language-specific statistics, please refer to metrics.md.
Core Metrics
Metric
Description
Formula
Precision
Proportion of valid comments generated
Valid matches / Total generated
Recall
Ability to discover annotated issues
Valid matches / Dataset valid count
Line Precision
Ability to precisely locate code lines
Line matches / Total generated
Noise Rate
Proportion of invalid or incorrect comments
Unmatched / Total generated
🤝 Contributing
We welcome community contributions! If you want to contribute to AACR-Bench, please follow these steps:
This project is licensed under the Apache License 2.0. For details, please see the LICENSE file.
📚 Citation
If you use AACR-Bench in your research, please cite our paper:
@misc{zhang2026aacrbenchevaluatingautomaticcode,
title={AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context},
author={Lei Zhang and Yongda Yu and Minghui Yu and Xinxin Guo and Zhengqi Zhuang and Guoping Rong and Dong Shao and Haifeng Shen and Hongyu Kuang and Zhengfeng Li and Boge Wang and Guoan Zhang and Bangyu Xiang and Xiaobin Xu},
year={2026},
eprint={2601.19494},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.19494},
}
🗺️ Roadmap
v1.0 (2026.01): Initial release - 200 PRs, 10 languages
🌟 Acknowledgments
Thanks to all contributors who participated in data annotation, especially core contributors who completed 15+ valid annotations. Full list in CONTRIBUTORS.md.
Thanks to open-source project maintainers for providing original PR data.
English | 简体中文
📋 Introduction
AACR-Bench is the industry’s first multilingual, repository-level context-aware code review evaluation dataset, designed to assess the performance of large language models in automated code review tasks. The dataset comprises 200 real Pull Requests from 50 active open-source projects, covering 10 mainstream programming languages. Each instance includes not only code changes but also preserves complete repository context, authentically reproducing the entire code review process. Through human-LLM collaborative review combined with multi-round expert annotation, we ensure high quality and comprehensiveness of the data.
✨ Core Features
🌍 Multi-language Coverage
Covers 10 mainstream programming languages used in projects:
📁 Repository-level Context
🤖 Human Expert + LLM Enhanced Annotation
Professional Annotation Team
LLM Intelligent Enhancement
Quality Assurance Process: GitHub human comments → LLM enhancement → Expert multi-round cross-annotation → Consistency validation
🎯 Evaluation Capabilities and Applications
AACR-Bench provides systematic evaluation capabilities across four core dimensions, supporting diverse research and application scenarios:
Evaluation Dimension System
• Language-specific optimization: Improve model capabilities for specific languages
• Generalization assessment: Test model’s language transfer effectiveness
• Cross-file tracking: Test cross-file reference identification capability
• Context boundaries: Verify issue scope judgment accuracy
• Severity assessment: Test issue priority judgment
• Specialized capabilities: Analyze detection rates for specific issue types
• File-level understanding: Complete file logic comprehension
• Repo-level understanding: Project-wide dependency analysis
Typical Application Scenarios
Model Development
Academic Research
Engineering Practice
🚀 Quick Start
Clone Repository and Download Dataset
Install Dependencies
Configure Claude CLI
Edit
configs/config.jsonand set the Claude CLI installation path:Configure Evaluation Environment Variables
Create a
.envfile in theevaluator_runner/utils/directory:Prepare Dataset
For the first run, you need to convert the raw data to task format. Uncomment and run in
main.py:This will:
data_path)finishflag to each PR for progress trackingtmp_data.jsontask fileRun Code Review
Run Evaluation
Batch Evaluation
📈 Data Overview
Dataset Scale
200
Pull Requests10
Programming Languages50
Source Projects2,145
Review CommentsClassification Statistics
Language Distribution
Data Format
📏 Evaluation Metrics
We employ a multidimensional metric system to comprehensively evaluate code review model performance. For complete metric definitions, calculation methods, and language-specific statistics, please refer to metrics.md.
Core Metrics
Valid matches / Total generatedValid matches / Dataset valid countLine matches / Total generatedUnmatched / Total generated🤝 Contributing
We welcome community contributions! If you want to contribute to AACR-Bench, please follow these steps:
git checkout -b feat/add-new-prs)git commit -m 'feat: add new PRs')git push origin feat/add-new-prs)For detailed contribution guidelines, please refer to CONTRIBUTING.md
👥 Authors and Maintainers
📄 License
This project is licensed under the Apache License 2.0. For details, please see the LICENSE file.
📚 Citation
If you use AACR-Bench in your research, please cite our paper:
🗺️ Roadmap
🌟 Acknowledgments