Awesome LLM-Driven Kernel Generation

🔥 Paper 🔥

Framework

The integration of Large Language Models (LLMs) and agentic systems marks a pivotal shift in high-performance computing, transforming kernel engineering from a labor-intensive, expert-dependent process into a scalable, automated workflow. To provide a systematic perspective on this rapidly evolving field, we summarize the related literature below. Note that the works are organized according to the taxonomy proposed in the survey. We categorized these researches into four main streams:

📌 Table of Content (ToC)

LLM4Kernel
Agent4Kernel
Datasets
Benchmarks

LLM4Kernel

LLMs for Kernel

Applying LLMs to kernel synthesis presents universal challenges in correctness and performance-sensitive structuring across diverse programming abstractions. Addressing these complexities, this section reviews the two principal post-training methodologies that dominate current research: supervised fine-tuning and reinforcement learning.

SFT

[06/2025] KernelLLM: Making Kernel Development More Accessible [link]

[10/2025] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [paper]

[03/2026] InCoder-32B: Code Foundation Model for Industrial Scenarios [paper] | [code]

[04/2026] InCoder-32B-Thinking: Industrial Code World Model for Thinking [paper] | [code]

RL

[07/2025] AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs [paper]

[07/2025] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning [paper] | [code]

[07/2025] Kevin: Multi-Turn RL for Generating CUDA Kernels [paper]

[09/2025] Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning [paper]

[10/2025] TritonRL: Training LLMs to Think and Code Triton Without Cheating [paper]

[11/2025] QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation [paper]

[12/2025] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning [paper]

[01/2026] AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units [paper]

[02/2026] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations [paper] | [code]

[02/2026] Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards [paper]

[02/2026] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [paper] | [code]

[03/2026] Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization [paper]

Agent4Kernel

LLM Agents for Kernel

While foundational LLMs are often limited to static, one-pass inference, agentic systems introduce an autonomous, closed-loop paradigm characterized by iterative planning, tool use, and feedback-driven refinement. This shift enables scalable, long-horizon optimization beyond the reach of manual or single-pass methods. To systematically evaluate this landscape, we categorize agent-driven advancements into four structural dimensions: learning mechanisms, external memory management, hardware profiling integration, and multi-agent orchestration.

Learning Mechanisms

[02/2025] Automating GPU Kernel Generation with Deepseek-r1 and Inference Time Scaling [blog]

[06/2025] GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization [paper]

[09/2025] Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [paper] | [code]

[10/2025] The FM Agent [paper]

[10/2025] EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models [paper]

[10/2025] KernelGen [link][12/2025]

[11/2025] AUTOCOMP: A POWERFUL AND PORTABLE CODE OPTIMIZER FOR TENSOR ACCELERATORS [paper]

[11/2025] AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization [paper] | [code]

[12/2025] PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations [paper]

[12/2025] cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution [paper]

[12/2025] GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs [paper]

[12/2025] Agentic Operator Generation for ML ASICs [paper]

[01/2026] DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation [paper]

[01/2026] MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization [paper]

[01/2026] AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation [paper]

[02/2026] K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [paper] | [code]

[03/2026] AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search [paper] | [code]

[03/2026] KERNELFOUNDRY: HARDWARE-AWARE EVOLUTIONARY GPU KERNEL OPTIMIZATION [paper]

[03/2026] AVO: Agentic Variation Operators for Autonomous Evolutionary Search [paper]

[03/2026] AKO: Agentic Kernel Optimization (a harness for existing coding agents) [project] | [code]

[04/2026] CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe [paper]

[04/2026] AdaExplore: Failure-Driven Adaptation and DiversityPreserving Search for Efficient Kernel Generation [paper] | [code]

[04/2026] FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow [paper]

[05/2026] CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels [paper]

External Memory / Experience / Skill Management

[02/2025] The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition [link]

[10/2025] From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph [paper]

[12/2025] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta [paper]

[02/2026] KERNELBLASTER: CONTINUAL CROSS-TASK CUDA OPTIMIZATION VIA MEMORY-AUGMENTED IN-CONTEXT REINFORCEMENT [paper]

[03/2026] Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis [paper] | [project]

[03/2026] KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization [paper] | [code]

[04/2026] ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants [paper]

Hardware Profiling Integration

[03/2025] IntelliKit: LLM-ready profiling and analysis toolkit for AMD GPUs [code]

[04/2025] QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models [paper]

[05/2025] QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives [paper]

[06/2025] CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [paper]

[06/2025] IntelliPerf: Profiling-guided LLM framework for iterative GPU kernel optimization on AMD GPUs [blog] | [code]

[07/2025] QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm [paper]

[08/2025] SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization [paper]

[10/2025] Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization [paper]

[11/2025] KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit [paper]

[11/2025] PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization [paper]

[12/2025] TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization [paper]

[04/2026] cuda-kernel-optimizer [code]

[05/2026] KEET: Explaining Performance of GPU Kernels Using LLM Agents [paper]

Multi-Agent Orchestration

[06/2025] AKG: Ai-powered automatic kernel generator [paper] | [code]

[07/2025] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [blog] | [paper] | [code]

[09/2025] Astra: A Multi-Agent System for GPU Kernel Performance Optimization [paper]

[10/2025] STARK: Strategic Team of Agents for Refining Kernels [paper]

[10/2025] CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization [paper] | [code]

[11/2025] KForge: Program Synthesis for Diverse AI Hardware Accelerators [paper]

[11/2025] KernelFalcon: Autonomous GPU Kernel Generation via Deep Agents [blog]

[01/2026] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization [paper]

[04/2026] Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization [paper]

Datasets

High-quality data in this domain is defined not merely by volume, but by its ability to bridge the semantic gap between high-level algorithms and low-level hardware optimizations. In this section, we survey the data landscape and organize resources. The dates listed in the table correspond to the initial release of each github repository. It is important to note that these libraries are under active development, with continuous updates and optimizations following their inception.

Structured Datasets

[02/2024] The Stack v2 (HPC Subset) [paper] | [dataset]

[06/2024] HPC-Instruct A Dataset for HPC Code Optimization [paper] | [dataset]

[05/2025] KernelBook Torch-Triton Aligned Corpus [dataset] | [repo]

[02/2025] KernelBench Samples Optimization Tasks & Performance Traces [dataset]

Source Code Repositories

Operator and Kernel Libraries

[12/2017] CUTLASS — CUDA C++ Template Library [code]

[05/2022] FlashAttention — Fast and Memory-Efficient Exact Attention [paper] | [code]

[11/2023] FlagAttention — Memory Efficient Attention Operators Implemented in Triton [code]

[02/2024] AoTriton — AOT-compiled Triton Kernels for AMD ROCm [code]

[11/2021] xFormers — Hackable and Optimized Transformer Building Blocks [code]

[08/2024] Liger-Kernel — Efficient Training Kernels for LLMs [code]

[04/2024] FlagGems — Triton-based Operator Library [code]

[09/2022] Bitsandbytes — 8-bit Quantization Wrappers for LLMs [code]

[09/2024] Gemlite — Triton Kernels for Efficient Low-Bit Matrix Multiplication [code]

[11/2024] AITER — AMD operator and kernel library for high-performance AI workloads [code]

[01/2025] FlashInfer — Kernel Library for LLM Serving [code]

[05/2021] FBGEMM — Low-precision High-performance Matrix Multiplication [code]

[09/2022] Transformer Engine — FP8 Acceleration Library for Transformer Models [code]

[09/2025] DeepGEMM — clean and efficient FP8 GEMM kernels with fine-grained scaling [code]

[01/2026] HPC-ops — High Performance LLM Inference Operator Library [code]

[04/2026] Tile Kernels — A kernel library written in tilelang [code]

Frameworks and System Integration Code

[10/2016] PyTorch (ATen) — Foundational Tensor Library for C++ and Python [code]

[06/2023] vLLM — Easy, Fast, and Cheap LLM Serving [paper] | [code]

[12/2023] SGLang — Structured Generation Language for LLMs [code]

[03/2023] llama.cpp — C/C++ Inference Port of LLaMA Models [code]

[03/2025] IntelliKit — LLM-ready profiling and analysis toolkit for AMD GPUs [code]

[08/2023] TensorRT-LLM — TensorRT for LLM Inference [code]

[10/2019] DeepSpeed — System for Large Scale Model Training [paper] | [code]

Domain-Specific Languages and Emerging Abstractions

[07/2019] Triton — Open-Source GPU Programming Language [paper] | [code]

[03/2024] ThunderKittens — Tile primitives for CUDA [paper] | [code]

[04/2024] TileLang — Intermediate Language for Tile-based Optimization [code]

[06/2024] tt-metal — Bare Metal Programming on Tenstorrent [code]

[12/2025] cuTile — NVIDIA DSL for Tile-centric Programming [docs]

Knowledge Bases

Documentation & Guides

[06/2007] CUDA C++ Programming Guide (Initial Release v1.0) [docs]

[06/2007] PTX ISA Reference (Initial Release v1.0) [docs]

[05/2020] NVIDIA Tuning Guides (Ampere Architecture Launch) [docs]

Community Indices & Tutorials

[01/2024] GPU-MODE Resource Stream [list]

[01/2024] Triton Index [list]

[06/2016] Awesome-CUDA [list]

[12/2023] Awesome-GPU-Engineering [list]

[05/2023] LeetCUDA CUDA Programming Exercises [code]

[01/2023] Triton-Puzzles Puzzles for learning Triton [code]

[01/2011] Colfax Research — technical hub dedicated to High-Performance Computing (HPC) and AI [link]

[09/2018] Nsight Compute — GPU Kernel Profiling Guide [docs]

[07/2024] CUDA Course [docs]

[actively maintained] HGPU - High performance computing on graphics processing units [link]

Benchmarks

This section surveys the landscape of kernel generation benchmarking, providing a structured analysis of key evaluation frameworks.

[01/2024] Can Large Language Models Write Parallel Code? [paper] | [code]

[02/2025] KernelBench: Can LLMs Write Efficient GPU Kernels? [blog]｜[paper]｜[code]

[02/2025] TRITONBENCH: Benchmarking Large Language Model Capabilities for Generating Triton Operators [paper]｜[code]

[07/2025] MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation [paper]｜[code]

[07/2025] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [blog] | [paper] | [code]

[09/2025] Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [paper] | [code]

[09/2025] BackendBench [code]

[10/2025] TritonGym: A Benchmark for Agentic LLM Workflows in Triton GPU Code Generation [paper]

[10/2025] From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph [paper]

[01/2026] FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems [paper] | [blog] | [Competition]

[02/2026] ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads? [paper] | [code] | [project]

[03/2026] ComputeEval [code]

[03/2026] KernelArena [code] | [project]

[03/2026] KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware [paper]

[03/2026] SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits [paper] | [project]

[03/2026] CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants? [paper]

[03/2026] Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts [paper]

[03/2026] KernelBench-v3 [code]

[03/2026] Standard Kernel Rubric: Evaluating Kernel Generation Systems [blog]

[04/2026] CANN Bench [code]

[05/2026] KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels [paper] | [code]

Contributing

Given the rapid pace of research in LLM-driven kernel generation, we may have inadvertently overlooked some key papers. Contributions to this repository are highly encouraged! Please feel free to submit a pull request or open an issue to share additions or feedback.

Citation

An early long preprint of this work was released on TechRxiv, which reflects an initial and exploratory stage of the survey. The current arXiv manuscript is a substantially improved and condensed revision, incorporating many additional recent works and a more focused and carefully refined presentation. If you find this work useful, please feel free to cite it as:

@misc{yu2026automatedkernelgenerationera,
      title={Towards Automated Kernel Generation in the Era of LLMs}, 
      author={Yang Yu and Peiyu Zang and Chi Hsu Tsai and Haiming Wu and Yixin Shen and Jialing Zhang and Haoyu Wang and Zhiyou Xiao and Jingze Shi and Yuyu Luo and Wentao Zhang and Chunlei Men and Guang Liu and Yonghua Lin},
      year={2026},
      eprint={2601.15727},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.15727}, 
}