InferSim: A Lightweight LLM Inference Performance Simulator
InferSim is a lightweight simulator for LLM inference, written in pure Python without any 3rd-party depenencies. It calculates the TTFT, TPOT and throughput TGS (tokens/GPU/second) based on computation complexity FLOPs (Floating-Point Operations), GPU computing power FLOPS (Floating-Point Operations per Second), GPU memory bandwidth and MFU (Model FLOPs Utilization) obtained by benchmarking the state-of-the-art LLM kernels. For multi-GPU, multi-node deployment, InferSim also estimates the communication latency according to data volume and bandwidth.
The main use cases of InferSim include:
Model-Sys co-design: predicting inference performance given the hyperparameters of a model.
Inference performance analysis: quantifying performance bottlenecks, such as compute-bound or IO-bound, and supporting optimization efforts.
Actual data tested with SGLang, simulation example: example/qwen3-8B/.
The accuracy of simulation results relies heavily on the kernel benchmark results. Please help us improve the kernel_benchmark and append the data to bench_data.
Supported Features
Attention: MHA/GQA, MLA. Benchmarked on FlashInfer, FlashAttention-3, FlashMLA.
MoE: GroupedGEMM. Benchmarked on DeepGEMM.
Linear: GEMM. Benchmarked on DeepGEMM.
Parallelization: DP Attn, EP MoE.
Large EP: DeepEP dispatch and combine, with normal and low_latency mode.
Help
$ python3 main.py --help
usage: main.py [-h] --config-path CONFIG_PATH [--device-type {H20,H800,H200,GB200}]
[--world-size WORLD_SIZE] [--num-nodes NUM_NODES]
[--max-prefill-tokens MAX_PREFILL_TOKENS] [--decode-bs DECODE_BS]
[--target-tgs TARGET_TGS] [--target-tpot TARGET_TPOT] [--target-isl TARGET_ISL]
[--target-osl TARGET_OSL] [--use-fp8-gemm] [--use-fp8-kv] [--enable-deepep]
[--enable-tbo] [--sm-ratio SM_RATIO] [--prefill-only] [--decode-only]
optional arguments:
-h, --help show this help message and exit
--config-path CONFIG_PATH
The path of the hf model config.json
--device-type {H20,H800,H200,GB200}
Device type
--world-size WORLD_SIZE
Num of GPUs
--num-nodes NUM_NODES
Num of nodes
--max-prefill-tokens MAX_PREFILL_TOKENS
Max prefill tokens per GPU
--decode-bs DECODE_BS
Decoding batchsize per GPU. If not specified, bs = tgs * tpot.
--target-tgs TARGET_TGS
Target tokens/s per GPU
--target-tpot TARGET_TPOT
TPOT in ms
--target-isl TARGET_ISL
Input sequence length, in tokens
--target-osl TARGET_OSL
Output sequence length, in tokens
--use-fp8-gemm Use fp8 gemm
--use-fp8-kv Use fp8 kvcache
--enable-deepep Enable DeepEP
--enable-tbo Enable two batch overlap
--sm-ratio SM_RATIO In TBO DeepEP normal mode, the SM ratio used for computation
--prefill-only Only simulate prefill
--decode-only Only simulate decoding
InferSim: A Lightweight LLM Inference Performance Simulator
InferSim is a lightweight simulator for LLM inference, written in pure Python without any 3rd-party depenencies. It calculates the TTFT, TPOT and throughput TGS (tokens/GPU/second) based on computation complexity FLOPs (Floating-Point Operations), GPU computing power FLOPS (Floating-Point Operations per Second), GPU memory bandwidth and MFU (Model FLOPs Utilization) obtained by benchmarking the state-of-the-art LLM kernels. For multi-GPU, multi-node deployment, InferSim also estimates the communication latency according to data volume and bandwidth.
The main use cases of InferSim include:
For more details, please check InferSim Technical Report.
Simulation Result
The accuracy of simulation results relies heavily on the kernel benchmark results. Please help us improve the kernel_benchmark and append the data to bench_data.
Supported Features
Help
Example
Acknowledgement
This work is developed and maintained by Alimama AI Infra Team & Future Living Lab, Alibaba Group.