McFlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more on MACA platform. McFlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
McFlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more on MACA platform. McFlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
Create Conda Env
Activate Conda Env
Install Dependencies
Set Environment Variables
Build
Clean build artifacts if needed.
Build AOT kernels and create FlashInfer distributions.
Please don’t use JIT mode because it is not stable yet.
Install Wheel