HPC-Ops is a production-grade, high-performance, and easy-to-use operator library for LLM inference, developed by the Tencent Hunyuan AI Infra team.
Key Features
SOTA Performance & Production-Proven: Deeply optimized kernels tailored for NVIDIA H20 GPUs, delivering SOTA performance with up to 2.22x speedup. Powering large-scale production inference in Tencent.
Easy to Integrate: A clean API designed for seamless integration into popular inference frameworks like vLLM and SGLang.
Rich Precision Support: Native support for multiple data types including BF16 and FP8 with different quantization schemes.
A Modern CUDA Tutorial: Hands-on examples of building SOTA kernels with CuTe and CUTLASS in just hundreds of lines.
Performance
Maximum observed speedup per operator
Operator
Baseline
Prefill
Decode
Attention (bf16)
FlashInfer, FA2, FA3, TensorRT-LLM
1.33x
2.22x
Attention (fp8)
FlashInfer, FA3, TensorRT-LLM
1.12x
2.0x
FusedMoE (fp8)
TensorRT-LLM, vLLM
1.49x
1.14x
GroupGEMM (fp8)
DeepGEMM
1.1x
1.88x
We focus on maximum speedup to highlight the optimization potential, as performance varies substantially across cases.
Supporting Kernels
Attention
Decode, Prefill: Optimized kernels for all attention phases, including paged attention.
Grouped GEMM
Quantized Grouped GEMM: FP8 weights with block-wise or per-tensor scaling
Fused MoE
Quantized Fused MoE: FP8 expert weights with block-wise or per-tensor scaling
Quick Start
Requirements
NVIDIA SM90 architecture GPU
Python 3.8 or higher
Compilers with C++17 support
CUDA Toolkit: CUDA 12.8 or higher
You can set up the environment by installing the modules listed in requirements-dev.txt.
Install from Source
git clone https://github.com/Tencent/hpc-ops.git
cd hpc-ops
# build packages
make wheel
python3 -m pip install dist/*.whl
For the usage of other operators, please refer to the corresponding test files in the tests/ directory.
Roadmap
Sparse Attention Kernels: Optimized for long-context LLMs, these kernels boost throughput for memory-bound workloads.
Extended Quantization Support: Flexible strategies (4bit/8bit mixed-precision included) kernel optimizations for quantized attention and GEMM which balance speed and accuracy.
Compute-Communication Boundary-Breaking Kernels: Overlapped computation and inter-GPU communication logic to minimizes overhead in multi-node/multi-GPU distributed inference.
We welcome targeted, high-impact contributions—whether it’s fixing edge-case kernel bugs, or submitting optimizations for niche LLM inference scenarios, your PRs will help refine this toolkit for production use.
⭐ Star this repo to follow our progress.
We’re continuously improving performance to make your LLM inference faster and more efficient.
More improvements are on the way.
HPC-Ops
HPC-Ops is a production-grade, high-performance, and easy-to-use operator library for LLM inference, developed by the Tencent Hunyuan AI Infra team.
Key Features
Performance
Maximum observed speedup per operator
We focus on maximum speedup to highlight the optimization potential, as performance varies substantially across cases.
Supporting Kernels
Attention
Grouped GEMM
Fused MoE
Quick Start
Requirements
You can set up the environment by installing the modules listed in requirements-dev.txt.
Install from Source
Basic Usage
Example: GroupGEMM fp8 kernel usage
For the usage of other operators, please refer to the corresponding test files in the tests/ directory.
Roadmap
We welcome targeted, high-impact contributions—whether it’s fixing edge-case kernel bugs, or submitting optimizations for niche LLM inference scenarios, your PRs will help refine this toolkit for production use.
⭐ Star this repo to follow our progress. We’re continuously improving performance to make your LLM inference faster and more efficient. More improvements are on the way.