The integration of Large Language Models (LLMs) and agentic systems marks a pivotal shift in high-performance computing, transforming kernel engineering from a labor-intensive, expert-dependent process into a scalable, automated workflow. To provide a systematic perspective on this rapidly evolving field, we summarize the related literature below. Note that the works are organized according to the taxonomy proposed in the survey. We categorized these researches into four main streams:
Applying LLMs to kernel synthesis presents universal challenges in correctness and performance-sensitive structuring across diverse programming abstractions. Addressing these complexities, this section reviews the two principal post-training methodologies that dominate current research: supervised fine-tuning and reinforcement learning.
SFT
[10/2025] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [paper]
[06/2025] KernelLLM: Making Kernel Development More Accessible [link]
[03/2026] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [paper] | [code]
RL
[09/2025] Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning [paper]
[07/2025] Kevin: Multi-Turn RL for Generating CUDA Kernels [paper]
[07/2025] AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs [paper]
[10/2025] TritonRL: Training LLMs to Think and Code Triton Without Cheating [paper]
[07/2025] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning [paper] | [code]
[12/2025] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning [paper]
[01/2026] AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units [paper]
[02/2026] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations [paper] | [code]
[02/2026] Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards [paper]
[02/2026] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [paper] | [code]
[03/2026] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization [paper]
Agent4Kernel
While foundational LLMs are often limited to static, one-pass inference, agentic systems introduce an autonomous, closed-loop paradigm characterized by iterative planning, tool use, and feedback-driven refinement. This shift enables scalable, long-horizon optimization beyond the reach of manual or single-pass methods. To systematically evaluate this landscape, we categorize agent-driven advancements into four structural dimensions: learning mechanisms, external memory management, hardware profiling integration, and multi-agent orchestration.
Learning Mechanisms
[02/2025] Automating GPU Kernel Generation with Deepseek-r1 and Inference Time Scaling [blog]
[12/2025] PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations [paper]
[12/2025] GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs [paper]
[03/2026] AVO: Agentic Variation Operators for Autonomous Evolutionary Search [paper]
[03/2026] AKO: Agentic Kernel Optimization (a harness for existing coding agents) [project] | [code]
[04/2026] CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe [paper]
External Memory / Experience / Skill Management
[02/2025] The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition [link]
[12/2025] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta [paper]
[10/2025] From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph [paper]
[02/2026] KERNELBLASTER: CONTINUAL CROSS-TASK CUDA OPTIMIZATION VIA MEMORY-AUGMENTED IN-CONTEXT REINFORCEMENT [paper]
[03/2026] Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis [paper] | [project]
[03/2026] KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization [paper] | [code]
Hardware Profiling Integration
[03/2025] IntelliKit: LLM-ready profiling and analysis toolkit for AMD GPUs [code]
[01/2026] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization [paper]
Datasets
High-quality data in this domain is defined not merely by volume, but by its ability to bridge the semantic gap between high-level algorithms and low-level hardware optimizations. In this section, we survey the data landscape and organize resources. The dates listed in the table correspond to the initial release of each github repository. It is important to note that these libraries are under active development, with continuous updates and optimizations following their inception.
Given the rapid pace of research in LLM-driven kernel generation, we may have inadvertently overlooked some key papers. Contributions to this repository are highly encouraged! Please feel free to submit a pull request or open an issue to share additions or feedback.
Citation
An early long preprint of this work was released on TechRxiv, which reflects an initial and exploratory stage of the survey. The current arXiv manuscript is a substantially improved and condensed revision, incorporating many additional recent works and a more focused and carefully refined presentation. If you find this work useful, please feel free to cite it as:
@misc{yu2026automatedkernelgenerationera,
title={Towards Automated Kernel Generation in the Era of LLMs},
author={Yang Yu and Peiyu Zang and Chi Hsu Tsai and Haiming Wu and Yixin Shen and Jialing Zhang and Haoyu Wang and Zhiyou Xiao and Jingze Shi and Yuyu Luo and Wentao Zhang and Chunlei Men and Guang Liu and Yonghua Lin},
year={2026},
eprint={2601.15727},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.15727},
}
关于
Review automated kernel generation in the era of LLMs
🔥 Paper 🔥
The integration of Large Language Models (LLMs) and agentic systems marks a pivotal shift in high-performance computing, transforming kernel engineering from a labor-intensive, expert-dependent process into a scalable, automated workflow. To provide a systematic perspective on this rapidly evolving field, we summarize the related literature below. Note that the works are organized according to the taxonomy proposed in the survey. We categorized these researches into four main streams:
📌 Table of Content (ToC)
LLM4Kernel
Applying LLMs to kernel synthesis presents universal challenges in correctness and performance-sensitive structuring across diverse programming abstractions. Addressing these complexities, this section reviews the two principal post-training methodologies that dominate current research: supervised fine-tuning and reinforcement learning.
SFT
[10/2025] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [paper]
[06/2025] KernelLLM: Making Kernel Development More Accessible [link]
[03/2026] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [paper] | [code]
RL
[09/2025] Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning [paper]
[07/2025] Kevin: Multi-Turn RL for Generating CUDA Kernels [paper]
[11/2025] QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation [paper]
[07/2025] AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs [paper]
[10/2025] TritonRL: Training LLMs to Think and Code Triton Without Cheating [paper]
[07/2025] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning [paper] | [code]
[12/2025] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning [paper]
[01/2026] AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units [paper]
[02/2026] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations [paper] | [code]
[02/2026] Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards [paper]
[02/2026] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [paper] | [code]
[03/2026] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization [paper]
Agent4Kernel
While foundational LLMs are often limited to static, one-pass inference, agentic systems introduce an autonomous, closed-loop paradigm characterized by iterative planning, tool use, and feedback-driven refinement. This shift enables scalable, long-horizon optimization beyond the reach of manual or single-pass methods. To systematically evaluate this landscape, we categorize agent-driven advancements into four structural dimensions: learning mechanisms, external memory management, hardware profiling integration, and multi-agent orchestration.
Learning Mechanisms
[02/2025] Automating GPU Kernel Generation with Deepseek-r1 and Inference Time Scaling [blog]
[12/2025] PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations [paper]
[12/2025] GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs [paper]
[01/2026] DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation [paper]
[12/2025] Agentic Operator Generation for ML ASICs [paper]
[10/2025] KernelGen [link]
[01/2026] MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization [paper]
[01/2026] AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation [paper]
[02/2026] K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [paper] | [code]
[11/2025] AUTOCOMP: A POWERFUL AND PORTABLE CODE OPTIMIZER FOR TENSOR ACCELERATORS [paper]
[09/2025] Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [paper] | [code]
[10/2025] The FM Agent [paper]
[10/2025] EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models [paper]
[06/2025] GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization [paper]
[06/2025] IntelliPerf: Profiling-guided LLM framework for iterative GPU kernel optimization on AMD GPUs [blog] | [code]
[12/2025] cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution [paper]
[03/2026] AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search [paper] | [code]
[03/2026] KERNELFOUNDRY: HARDWARE-AWARE EVOLUTIONARY GPU KERNEL OPTIMIZATION [paper]
[03/2026] AVO: Agentic Variation Operators for Autonomous Evolutionary Search [paper]
[03/2026] AKO: Agentic Kernel Optimization (a harness for existing coding agents) [project] | [code]
[04/2026] CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe [paper]
External Memory / Experience / Skill Management
[02/2025] The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition [link]
[12/2025] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta [paper]
[10/2025] From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph [paper]
[02/2026] KERNELBLASTER: CONTINUAL CROSS-TASK CUDA OPTIMIZATION VIA MEMORY-AUGMENTED IN-CONTEXT REINFORCEMENT [paper]
[03/2026] Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis [paper] | [project]
[03/2026] KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization [paper] | [code]
Hardware Profiling Integration
[03/2025] IntelliKit: LLM-ready profiling and analysis toolkit for AMD GPUs [code]
[05/2025] QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives [paper]
[04/2025] QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models [paper]
[07/2025] QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm [paper]
[08/2025] SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization [paper]
[06/2025] CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [paper]
[12/2025] TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization [paper]
[11/2025] PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization [paper]
[11/2025] KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit [paper]
[10/2025] Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization [paper]
Multi-Agent Orchestration
[10/2025] STARK: Strategic Team of Agents for Refining Kernels [paper]
[06/2025] AKG: Ai-powered automatic kernel generator [paper] | [code]
[10/2025] CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization [paper] | [code]
[11/2025] KForge: Program Synthesis for Diverse AI Hardware Accelerators [paper]
[11/2025] KernelFalcon: Autonomous GPU Kernel Generation via Deep Agents [blog]
[09/2025] Astra: A Multi-Agent System for GPU Kernel Performance Optimization [paper]
[07/2025] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [blog] | [paper] | [code]
[01/2026] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization [paper]
Datasets
High-quality data in this domain is defined not merely by volume, but by its ability to bridge the semantic gap between high-level algorithms and low-level hardware optimizations. In this section, we survey the data landscape and organize resources. The dates listed in the table correspond to the initial release of each github repository. It is important to note that these libraries are under active development, with continuous updates and optimizations following their inception.
Structured Datasets
[02/2024] The Stack v2 (HPC Subset) [paper] | [dataset]
[06/2024] HPC-Instruct A Dataset for HPC Code Optimization [paper] | [dataset]
[05/2025] KernelBook Torch-Triton Aligned Corpus [dataset] | [repo]
[02/2025] KernelBench Samples Optimization Tasks & Performance Traces [dataset]
Source Code Repositories
Operator and Kernel Libraries
[12/2017] CUTLASS — CUDA C++ Template Library [code]
[05/2022] FlashAttention — Fast and Memory-Efficient Exact Attention [paper] | [code]
[11/2023] FlagAttention — Memory Efficient Attention Operators Implemented in Triton [code]
[02/2024] AoTriton — AOT-compiled Triton Kernels for AMD ROCm [code]
[11/2021] xFormers — Hackable and Optimized Transformer Building Blocks [code]
[08/2024] Liger-Kernel — Efficient Training Kernels for LLMs [code]
[04/2024] FlagGems — Triton-based Operator Library [code]
[09/2022] Bitsandbytes — 8-bit Quantization Wrappers for LLMs [code]
[09/2024] Gemlite — Triton Kernels for Efficient Low-Bit Matrix Multiplication [code]
[11/2024] AITER — AMD operator and kernel library for high-performance AI workloads [code]
[01/2025] FlashInfer — Kernel Library for LLM Serving [code]
[05/2021] FBGEMM — Low-precision High-performance Matrix Multiplication [code]
[09/2022] Transformer Engine — FP8 Acceleration Library for Transformer Models [code]
Frameworks and System Integration Code
[10/2016] PyTorch (ATen) — Foundational Tensor Library for C++ and Python [code]
[06/2023] vLLM — Easy, Fast, and Cheap LLM Serving [paper] | [code]
[12/2023] SGLang — Structured Generation Language for LLMs [code]
[03/2023] llama.cpp — C/C++ Inference Port of LLaMA Models [code]
[03/2025] IntelliKit — LLM-ready profiling and analysis toolkit for AMD GPUs [code]
[08/2023] TensorRT-LLM — TensorRT for LLM Inference [code]
[10/2019] DeepSpeed — System for Large Scale Model Training [paper] | [code]
Domain-Specific Languages and Emerging Abstractions
[07/2019] Triton — Open-Source GPU Programming Language [paper] | [code]
[03/2024] ThunderKittens — Tile primitives for CUDA [paper] | [code]
[04/2024] TileLang — Intermediate Language for Tile-based Optimization [code]
[06/2024] tt-metal — Bare Metal Programming on Tenstorrent [code]
[12/2025] cuTile — NVIDIA DSL for Tile-centric Programming [docs]
Knowledge Bases
Documentation & Guides
[06/2007] CUDA C++ Programming Guide (Initial Release v1.0) [docs]
[06/2007] PTX ISA Reference (Initial Release v1.0) [docs]
[05/2020] NVIDIA Tuning Guides (Ampere Architecture Launch) [docs]
Community Indices & Tutorials
[01/2024] GPU-MODE Resource Stream [list]
[01/2024] Triton Index [list]
[06/2016] Awesome-CUDA [list]
[12/2023] Awesome-GPU-Engineering [list]
[05/2023] LeetCUDA CUDA Programming Exercises [code]
[01/2023] Triton-Puzzles Puzzles for learning Triton [code]
[01/2011] Colfax Research — technical hub dedicated to High-Performance Computing (HPC) and AI [link]
[09/2018] Nsight Compute — GPU Kernel Profiling Guide [docs]
[07/2024] CUDA Course [docs]
[actively maintained] HGPU - High performance computing on graphics processing units [link]
Benchmarks
This section surveys the landscape of kernel generation benchmarking, providing a structured analysis of key evaluation frameworks.
[01/2024] Can Large Language Models Write Parallel Code? [paper] | [code]
[02/2025] KernelBench: Can LLMs Write Efficient GPU Kernels? [blog]|[paper]|[code]
[02/2025] TRITONBENCH: Benchmarking Large Language Model Capabilities for Generating Triton Operators [paper]|[code]
[07/2025] MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation [paper]|[code]
[07/2025] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [blog] | [paper] | [code]
[09/2025] Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [paper] | [code]
[10/2025] TritonGym: A Benchmark for Agentic LLM Workflows in Triton GPU Code Generation [paper]
[09/2025] BackendBench [code]
[10/2025] From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph [paper]
[01/2026] FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems [paper] | [blog] | [Competition]
[02/2026] ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads? [paper] | [code] | [project]
[03/2026] ComputeEval [code]
[03/2026] KernelArena [code] | [project]
[03/2026] KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware [paper]
[03/2026] SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits [paper] | [project]
[03/2026] CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants? [paper]
[03/2026] Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts [paper]
[03/2026] KernelBench-v3 [code]
Contributing
Given the rapid pace of research in LLM-driven kernel generation, we may have inadvertently overlooked some key papers. Contributions to this repository are highly encouraged! Please feel free to submit a pull request or open an issue to share additions or feedback.
Citation
An early long preprint of this work was released on TechRxiv, which reflects an initial and exploratory stage of the survey. The current arXiv manuscript is a substantially improved and condensed revision, incorporating many additional recent works and a more focused and carefully refined presentation. If you find this work useful, please feel free to cite it as: