Support empty mode of vllm in PPU (#190)
PR Category
Core
PR Type
New Features
Description
Add _C op schema registration and PPU support
Add _C_ops_registry.py to register torch.ops._C op schemas when the native vllm._C extension is unavailable, enabling torch.compile pattern matching on non-CUDA platforms.
Bundle schemas as a pre-generated Python module (_C_ops_schemas.py) instead of extracting from source at runtime. Add tools/generate_op_schemas.py for regeneration on vLLM upgrades.
Patch vllm.vllm_flash_attn import to stub gracefully when CUDA flash attention extensions are missing.
- Add thead (PPU) backend to supported device map.
Testing:
Verified torch.compile + CUDAGraph capture works for Qwen3-4B, Qwen3.6-27B, and Qwen3.6-35B-A3B.
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
vllm-plugin-FL
vllm-plugin-FL is a plugin for the vLLM inference/serving framework, built on FlagOS’s unified multi-chip backend — including the unified operator library FlagGems and the unified communication library FlagCX. It extends vLLM’s capabilities and performance across diverse hardware environments. Without changing vLLM’s original interfaces or usage patterns, the same command can run model inference/serving on different chips.
Supported Models and Chips
In theory, vllm-plugin-FL can support all models available in vLLM, as long as no unsupported operators are involved. The tables below summarize the current support status of end-to-end verified models and chips, including both fully supported and in-progress (“Merging”) entries.
Supported Models
Supported Chips
Quick Start
Setup
Install vllm-plugin-FL
2.1 Clone the repository:
2.2 install
Install FlagGems
3.1 Install Build Dependencies
3.2 Installation FlagGems
Note: if on Sunrize platform, depends on FlagGems PR #2949
(Optional) Install FlagCX
4.1 Clone the repository:
4.2 Build the library with different flags targeting to different platforms:
4.3 Set environment
4.4 Installation FlagCX
Note: [xxx] should be selected according to the current platform, e.g., nvidia, ascend, etc.
If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS=’fl’.
Additional Steps for Ascend
Install FlagTree
Set required environment variable
Enable eager execution
Ascend requires eager execution. Add
enforce_eager=Trueto theLLMconstructor or pass--enforce-eageron the command line.Run a Task
Offline Batched Inference
With vLLM and vLLM-fl installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: offline_inference. Or use blow python script directly.
Advanced use
For dispatch environment variable usage, see environment variables usage.
Using Cuda Communication library
If you want to use the original Cuda Communication, you can unset the following environment variables.
Using native CUDA operators
If you want to use the original CUDA operators, you can set the following environment variables.