KernelGenBench is a component of FlagOS — a unified, open-source AI system software stack that
fosters an open technology ecosystem by seamlessly integrating various models, systems, and chips.
Following the principle of “develop once, migrate across various chips”,
FlagOS aims to unlock the full computational potential of hardware, break down barriers
between different chip software stacks, and effectively reduce migration costs.
KernelGenBench is a benchmark framework for evaluating LLM and agent-based Triton kernel generation across multiple hardware platforms.
Features
210 operators across three sources: ATen (110), vLLM (50), cuBLAS (50)
# For Agent Track, also install Claude Code CLI:
npm install -g @anthropic-ai/claude-code
Note: On NVIDIA platforms, vllm==0.13.0 will automatically install compatible versions of torch and triton. On non-NVIDIA platforms, torch and triton are pre-installed in the vendor container image — do NOT install vllm.
Configure API credentials:
# Anthropic Claude
export ANTHROPIC_API_KEY=your_key
# OpenAI / OpenAI-compatible
export OPENAI_API_KEY=your_key
export OPENAI_BASE_URL=http://your-endpoint/v1 # optional, for custom endpoints
Datasets
Dataset
Operators
Description
KernelGenBench
210
Full set (ATen + vLLM + cuBLAS, NVIDIA-only)
KernelGenBench-aten
110
ATen operators only
KernelGenBench-vllm
50
vLLM operators only (NVIDIA-only)
KernelGenBench-cublas
50
cuBLAS operators only (NVIDIA-only)
On non-NVIDIA chips, the default dataset is automatically set to KernelGenBench-aten (vLLM and cuBLAS operators require NVIDIA GPUs).
Device type is auto-detected. All platforms use the same commands — the framework handles device differences automatically.
Results
Multi-Source (NVIDIA A100, 210 operators)
Evaluation across 210 operators from three sources (ATen, vLLM, cuBLAS), showing accuracy and speedup by operator source across all generation paradigms.
Multi-Chip (110 ATen operators, 6 platforms)
Cross-platform evaluation on 110 ATen operators across six hardware platforms, showing whether correctness and speedup transfer across heterogeneous hardware backends. Platforms A–E are anonymized vendor hardware.
Generating Triton kernels on non-NVIDIA hardware incurs significant additional cost — up to 2× more tokens and time due to immature compilers and incomplete backend support.
LLM Track
Evaluate an LLM on generating Triton kernels with Pass@K metric:
Single operator to test (e.g., aten::add, vllm13::rms_norm)
All operators
--single-test
Randomly pick 1 operator for quick testing
Off
--dataset
Dataset to use (KernelGenBench, KernelGenBench-aten, -vllm, -cublas)
Auto-detect
--server-type
LLM provider (openai, anthropic)
openai
--model-name
Model name
gpt-4o
--max-rounds
Number of Pass@K rounds
10
--device-count
Number of GPUs for verification
8
--timeout
Timeout per operator (seconds)
300
--temperature
Sampling temperature
0.8
--reflection
Use previous round’s errors as feedback
Off
--resume-from
Resume from existing checkpoint directory
-
--debug
Debug mode (only 8 operators)
Off
Agent Track
Evaluate coding agents that iteratively generate, verify, and fix kernels.
Setup
Option A: Single environment (recommended)
Install Claude Code CLI into the same environment that has torch/vllm:
# In your KernelGenBench environment
npm install -g @anthropic-ai/claude-code
cp agent_bench/config.example.yaml agent_bench/config.yaml
# Edit config.yaml: set paths.python to your current Python path
Option B: Separate environments (if you already have Claude Code installed elsewhere)
If you have Claude Code in a different environment, set paths.python in config.yaml to point to the Python with torch/vllm, and export the Claude env’s PATH:
cp agent_bench/config.example.yaml agent_bench/config.yaml
# Edit config.yaml:
# paths.python: /path/to/envs/kernelgenbench/bin/python
# When running, export PATH to include your Claude Code env:
export PATH="/path/to/envs/claude_tool/bin:$PATH"
cd agent_bench && bash test_ops.sh add --device-count 1
The key config fields in config.yaml:
paths.python — Python interpreter with torch + vllm + kernelgenbench installed (used for verification)
To benchmark your own operators, add test cases to src/kernelgenbench/accuracy/ and register them in the dataset. See CONTRIBUTING.md for step-by-step instructions.
Contributing
We welcome contributions! You can:
Add new operators — expand the benchmark with new test cases
Add new chip backends — extend support to additional hardware
Add new agents — integrate coding tools like Codex, Trae, Cursor
Add new agentic methods — contribute specialized optimization pipelines
[中文版|English]
Overview
KernelGenBench is a component of FlagOS — a unified, open-source AI system software stack that fosters an open technology ecosystem by seamlessly integrating various models, systems, and chips. Following the principle of “develop once, migrate across various chips”, FlagOS aims to unlock the full computational potential of hardware, break down barriers between different chip software stacks, and effectively reduce migration costs.
KernelGenBench is a benchmark framework for evaluating LLM and agent-based Triton kernel generation across multiple hardware platforms.
Features
Setup
Install dependencies for your platform:
Configure API credentials:
Datasets
KernelGenBenchKernelGenBench-atenKernelGenBench-vllmKernelGenBench-cublasOn non-NVIDIA chips, the default dataset is automatically set to
KernelGenBench-aten(vLLM and cuBLAS operators require NVIDIA GPUs).Supported Devices
KernelGenBench supports 6 hardware platforms: NVIDIA, Ascend, MUSA, Hygon, Iluvatar, MetaX.
Device type is auto-detected. All platforms use the same commands — the framework handles device differences automatically.
Results
Multi-Source (NVIDIA A100, 210 operators)
Evaluation across 210 operators from three sources (ATen, vLLM, cuBLAS), showing accuracy and speedup by operator source across all generation paradigms.
Multi-Chip (110 ATen operators, 6 platforms)
Cross-platform evaluation on 110 ATen operators across six hardware platforms, showing whether correctness and speedup transfer across heterogeneous hardware backends. Platforms A–E are anonymized vendor hardware.
Generating Triton kernels on non-NVIDIA hardware incurs significant additional cost — up to 2× more tokens and time due to immature compilers and incomplete backend support.
LLM Track
Evaluate an LLM on generating Triton kernels with Pass@K metric:
Parameters
--op-nameaten::add,vllm13::rms_norm)--single-test--datasetKernelGenBench,KernelGenBench-aten,-vllm,-cublas)--server-typeopenai,anthropic)openai--model-namegpt-4o--max-rounds--device-count--timeout--temperature--reflection--resume-from--debugAgent Track
Evaluate coding agents that iteratively generate, verify, and fix kernels.
Setup
Option A: Single environment (recommended)
Install Claude Code CLI into the same environment that has torch/vllm:
Option B: Separate environments (if you already have Claude Code installed elsewhere)
If you have Claude Code in a different environment, set
paths.pythonin config.yaml to point to the Python with torch/vllm, and export the Claude env’s PATH:The key config fields in
config.yaml:paths.python— Python interpreter with torch + vllm + kernelgenbench installed (used for verification)agent.bin— path to agent CLI executable (default:claude, searches PATH)Methods
naive_ccbash test_ops.sh add -m naive_ccnormal_ccbash test_ops.sh add -m normal_ccnaive_opencodebash test_ops.sh add -m naive_opencodenormal_opencodebash test_ops.sh add -m normal_opencodebash test_autokernel.sh addbash test_ako4all.sh addbash test_cuda_optimized_skill.sh addRunning
Parameters (
test_ops.sh)[operators]-d, --datasetKernelGenBench-m, --methodnaive_cc,normal_cc,naive_opencode,normal_opencode)normal_cc--device-count--timeout--skip-gen--skip-verify-v, --verboseResults
Results are saved to
agent_bench/runs/<run_name>/:progress.json— real-time progress trackingkernels/— generated kernel filesresults.json— verification resultsAnalyzing Results
Project Structure
Evaluating Custom Operators
To benchmark your own operators, add test cases to
src/kernelgenbench/accuracy/and register them in the dataset. See CONTRIBUTING.md for step-by-step instructions.Contributing
We welcome contributions! You can:
See CONTRIBUTING.md for detailed guides.
Related Projects
Citation
If you find KernelGenBench useful in your research or evaluation, please cite:
License
This project is licensed under the MIT License.