Merge pull request #11 from zhan3916/main Performance: Add new tile for small batch size
Merge pull request #11 from zhan3916/main
Performance: Add new tile for small batch size
We provide the implementation of FlashMLA from FlashAttention-2(version 2.6.3), based on MACA toolkit and C500 chips.
FlashAttention-2 currently supports:
Requirements:
To install flash attn in conda env:
export MACA_PATH=/your/maca/path export CUDA_PATH=$MACA_PATH/tools/cu-bridge export MACA_CLANG_PATH=$MACA_PATH/mxgpu_llvm/bin export LD_LIBRARY_PATH=$MACA_PATH/lib:$MACA_PATH/mxgpu_llvm/lib:$MACA_PATH/ompi/lib:$LD_LIBRARY_PATH
python setup.py install
python tests/test_flash_mla.py
from flash_mla import get_mla_metadata, flash_mla_with_kvcache tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv) for i in range(num_layers): ... o_i, lse_i = flash_mla_with_kvcache( q_i, kvcache_i, block_table, cache_seqlens, dv, tile_scheduler_metadata, num_splits, causal=True, ) ...
FlashMLA is inspired by FlashAttention 2&3 and cutlass projects.
@misc{flashmla2025, title={FlashMLA: Efficient MLA decoding kernel}, author={Jiashi Li}, year={2025}, publisher = {GitHub}, howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}}, }
Fast and efficient attention method exploration and implementation.
©Copyright 2023 CCF 开源发展委员会 Powered by Trustie& IntelliDE 京ICP备13000930号
FlashMLA on MXMACA
We provide the implementation of FlashMLA from FlashAttention-2(version 2.6.3), based on MACA toolkit and C500 chips.
FlashAttention-2 currently supports:
How to run on MXMACA Device
Installation
Requirements:
To install flash attn in conda env:
Set environment variables
Install
Benchmark
Usage
Requirements
Acknowledgement
FlashMLA is inspired by FlashAttention 2&3 and cutlass projects.
Citation