Support empty mode of vllm in PPU (#190)

PR Category

Core

PR Type

New Features

Description

Add _C op schema registration and PPU support

Add _C_ops_registry.py to register torch.ops._C op schemas when the native vllm._C extension is unavailable, enabling torch.compile pattern matching on non-CUDA platforms.
Bundle schemas as a pre-generated Python module (_C_ops_schemas.py) instead of extracting from source at runtime. Add tools/generate_op_schemas.py for regeneration on vLLM upgrades.
Patch vllm.vllm_flash_attn import to stub gracefully when CUDA flash attention extensions are missing.
- Add thead (PPU) backend to supported device map.
Testing:
Verified torch.compile + CUDAGraph capture works for Qwen3-4B, Qwen3.6-27B, and Qwen3.6-35B-A3B.

vllm-plugin-FL

vllm-plugin-FL is a plugin for the vLLM inference/serving framework, built on FlagOS’s unified multi-chip backend — including the unified operator library FlagGems and the unified communication library FlagCX. It extends vLLM’s capabilities and performance across diverse hardware environments. Without changing vLLM’s original interfaces or usage patterns, the same command can run model inference/serving on different chips.

Supported Models and Chips

In theory, vllm-plugin-FL can support all models available in vLLM, as long as no unsupported operators are involved. The tables below summarize the current support status of end-to-end verified models and chips, including both fully supported and in-progress (“Merging”) entries.

Supported Models

Model	Status	Reference
Qwen3.5-397B-A17B	Supported	example
Qwen3-Next-80B-A3B	Supported	example
Qwen3-4B	Supported	example
MiniCPM-o 4.5	Supported	example
GLM-5	Supported	example
Qwen3.5-35B-A3B	Supported	example
BAAI/bge-m3	Supported	implementation
MiniMax-M2.7	Supported	implementation

Supported Chips

Chip Vendor	Status	Reference
NVIDIA	Supported	-
Ascend	Supported	-
MetaX	Supported	-
Pingtouge-Zhenwu	Supported	-
Iluvatar	Supported	-
Tsingmicro	Merging	PR #52
Moore Threads	Supported	-
Hygon	Supported	-
Sunrise	Supported	-

Quick Start

Setup

Install vllm from the official v0.20.2 (optional if the correct version is installed)

Install vllm-plugin-FL

2.1 Clone the repository:

git clone https://github.com/flagos-ai/vllm-plugin-FL

2.2 install

cd vllm-plugin-FL
pip install --no-build-isolation .
# or editble install
pip install --no-build-isolation -e .

Install FlagGems

3.1 Install Build Dependencies

pip install -U scikit-build-core==0.11 pybind11 ninja cmake

3.2 Installation FlagGems

git clone https://github.com/flagos-ai/FlagGems
cd FlagGems
git checkout v5.0.0
pip install --no-build-isolation .
# or editble install
pip install --no-build-isolation -e .

Note: if on Sunrize platform, depends on FlagGems PR #2949

   if on Hygon platform, depends on FlagGems [PR #3477](https://github.com/flagos-ai/FlagGems/pull/3477)

(Optional) Install FlagCX

4.1 Clone the repository:

git clone https://github.com/flagos-ai/FlagCX.git
cd FlagCX
git checkout -b v0.9.0
git submodule update --init --recursive

4.2 Build the library with different flags targeting to different platforms:

make USE_NVIDIA=1

4.3 Set environment

export FLAGCX_PATH="$PWD"

4.4 Installation FlagCX

cd plugin/torch/
FLAGCX_ADAPTOR=[xxx] pip install . --no-build-isolation
# or editable install
FLAGCX_ADAPTOR=[xxx] pip install -e . --no-build-isolation

Note: [xxx] should be selected according to the current platform, e.g., nvidia, ascend, etc.

If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS=’fl’.

Additional Steps for Ascend

Install FlagTree

RES="--index-url=https://resource.flagos.net/repository/flagos-pypi-hosted/simple --trusted-host=https://resource.flagos.net"
python3 -m pip install flagtree==0.4.0+ascend3.2 $RES

Set required environment variable
```
export TRITON_ALL_BLOCKS_PARALLEL=1
```
Enable eager execution

Ascend requires eager execution. Add enforce_eager=True to the LLM constructor or pass --enforce-eager on the command line.

Run a Task

Offline Batched Inference

With vLLM and vLLM-fl installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: offline_inference. Or use blow python script directly.

from vllm import LLM, SamplingParams
import torch
from vllm.config.compilation import CompilationConfig


if __name__ == '__main__':
    prompts = [
        "Hello, my name is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=10, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="Qwen/Qwen3-4B", max_num_batched_tokens=16384, max_num_seqs=2048)
    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Advanced use

For dispatch environment variable usage, see environment variables usage.

Using Cuda Communication library

If you want to use the original Cuda Communication, you can unset the following environment variables.

unset FLAGCX_PATH

Using native CUDA operators

If you want to use the original CUDA operators, you can set the following environment variables.

export USE_FLAGGEMS=0