This repository contains the improved attention kernel for vLLM’s XFORMERS backend with prefetching optimizations, as presented in our paper accepted at AAAI 2026.
Overview
This work enhances the vLLM XFORMERS backend attention kernel by incorporating prefetching techniques to improve memory access patterns and reduce latency during attention computation. The key innovation is the addition of asynchronous memory prefetching instructions for key and value cache blocks specifically within the XFORMERS attention implementation, which helps hide memory access latency and improve overall throughput. The experiments in our paper were conducted on H20 GPU.
Key Improvements
Implemented cp.async.bulk.prefetch.L2.global instructions to prefetch next key and value cache blocks
Installation
Replace the csrc/attention/attention_kernels.cuh file in your vLLM project with the attention_kernels.cuh file from this repository
Execute the following command in the vLLM root directory to recompile vLLM:
pip install --no-build-isolation -e .
Note: For detailed instructions on installing vLLM, please refer to the official documentation. This file was last tested with vLLM v0.10.2.
Usage
To use the prefetch-enhanced XFORMERS attention kernel, set the environment variable before starting vLLM:
export VLLM_ATTENTION_BACKEND=XFORMERS
Paper Citation
Details of this work will be available in our paper:
This implementation is based on the vLLM project and NVIDIA’s FasterTransformer implementation. We thank the vLLM team for their excellent work on optimizing large language model inference, especially their XFORMERS backend implementation.
License
This project is licensed under the Apache License 2.0
Prefetch-enhanced vLLM XFORMERS Attention Kernel
This repository contains the improved attention kernel for vLLM’s XFORMERS backend with prefetching optimizations, as presented in our paper accepted at AAAI 2026.
Overview
This work enhances the vLLM XFORMERS backend attention kernel by incorporating prefetching techniques to improve memory access patterns and reduce latency during attention computation. The key innovation is the addition of asynchronous memory prefetching instructions for key and value cache blocks specifically within the XFORMERS attention implementation, which helps hide memory access latency and improve overall throughput. The experiments in our paper were conducted on H20 GPU.
Key Improvements
cp.async.bulk.prefetch.L2.globalinstructions to prefetch next key and value cache blocksInstallation
csrc/attention/attention_kernels.cuhfile in your vLLM project with theattention_kernels.cuhfile from this repositoryUsage
To use the prefetch-enhanced XFORMERS attention kernel, set the environment variable before starting vLLM:
Paper Citation
Details of this work will be available in our paper:
<< Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching >>
Acknowledgements
This implementation is based on the vLLM project and NVIDIA’s FasterTransformer implementation. We thank the vLLM team for their excellent work on optimizing large language model inference, especially their XFORMERS backend implementation.
License
This project is licensed under the Apache License 2.0