[hardware, fsdp, rollout, recipe] feat: add MUSA platform support with FlagCX heterogeneous communication (#10)
Add Moore Threads MUSA platform support for heterogeneous CUDA+MUSA distributed training via FlagCX communication backend.
What does this PR do?
Enable heterogeneous distributed training across NVIDIA GPU and Moore Threads MUSA nodes using FlagCX as the unified communication backend. This includes platform abstraction, device isolation, and weight synchronization.
Architecture
┌─────────────────────────┐ ┌─────────────────────────┐ │ NVIDIA Nodes │ │ MUSA Nodes │ │ (Actor / Critic) │◄── FlagCX ──►│ (Rollout / vLLM) │ │ FSDP + NCCL │ │ torch_musa │ └─────────────────────────┘ └─────────────────────────┘Platform Layer
verl/plugin/platform/platform_musa.py— New MUSA platform with FlagCX auto-detectionverl/plugin/platform/platform_cuda.py— CUDA platform FlagCX auto-detectionverl/plugin/platform/platform_base.py,platform_npu.py,
platform_manager.py— Platform abstraction extensions
verl/__init__.py— Platform initialization at import timeverl/utils/device.py— Device utilities (is_musa_available,get_nccl_backend, etc.)verl/utils/torch_functional.py— Device-compatible tensor operationsFlagCX Communication
verl/utils/distributed.py—vllm_stateless_init_process_groupselects FlagCX/NCCL/HCCL communicator based on platform; compound backend support ininitialize_global_process_groupverl/utils/flagcx_communicator.py— Re-exportsPyFlagcxCommunicatorfrom vllm-plugin-FL. Patchestorch.musa.Streamto exposecuda_streamattribute for FlagCX wrapper compatibility on MUSA devices.verl/workers/fsdp_workers.py— Compound backend format (cpu:gloo,device:flagcx) forinit_process_groupcallsverl/workers/engine/fsdp/utils.py— Standardinit_device_meshfor device mesh creationWorker Device Isolation
verl/single_controller/base/worker.py— MUSA workers: keep all devices visible, useset_device(physical_index)instead ofMUSA_VISIBLE_DEVICES. Falls back to Ray runtime context for device assignment whenCUDA_VISIBLE_DEVICESis not set.One-Step Off-Policy Recipe
recipe/one_step_off_policy/ray_trainer.py— Usetorch.distributed-based weight sync for FlagCX (Ray collective only supports nccl/gloo)recipe/one_step_off_policy/fsdp_workers.py— Use_weight_sync_groupfor broadcast when available (FlagCX/NPU), fall back to Ray collectiverecipe/one_step_off_policy/distributed_util.py— Simplified distributed utilities for heterogeneous setupStatus
- Platform plugin and device detection
- FlagCX weight sync communicator (re-exported from vllm-plugin-FL)
- MUSA Stream compatibility patch for FlagCX wrapper
- Worker device isolation (Ray runtime context fallback)
- Ray resource mapping for non-CUDA platforms
- FSDP compound backend support
Test
Tested on heterogeneous cluster with NVIDIA GPU nodes + Moore Threads MUSA nodes (8 devices each) using the one_step_off_policy PPO recipe with FlagCX as unified communication backend.
Checklist Before Submitting
- Read the Contribute Guide
- Apply pre-commit checks
- Add / Update the documentation
- Add unit or end-to-end test(s)
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
verl-FL
verl-FL is a fork of verl designed to support diverse AI accelerators. It is built on top of FlagOS, a unified open-source AI system software stack, and integrates key components including the training engines Megatron-LM-FL and Transformer-Engine-FL, as well as the inference engine vllm-plugin-FL.
verl: Volcano Engine Reinforcement Learning for LLMs
verl is a flexible, efficient and production-ready RL training library for large language models (LLMs).
verl is the open-source version of HybridFlow: A Flexible and Efficient RLHF Framework paper.
verl is flexible and easy to use with:
Easy extension of diverse RL algorithms: The hybrid-controller programming model enables flexible representation and efficient execution of complex post-training dataflows. Build RL dataflows such as GRPO, PPO in a few lines of code.
Seamless integration of existing LLM infra with modular APIs: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as FSDP, Megatron-LM, vLLM, SGLang, etc
Flexible device mapping: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
Ready integration with popular HuggingFace models
verl is fast with:
State-of-the-art throughput: SOTA LLM training and inference engine integrations and SOTA RL throughput.
Efficient actor model resharding with 3D-HybridEngine: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
News
recipe/daponow.more...
Key Features
Upcoming Features and Changes
Getting Started
Documentation
Quickstart:
Running a PPO example step-by-step:
Reproducible algorithm baselines:
For code explanation and advance usage (extension):
PPO Trainer and Workers
Advanced Usage and Extension
Blogs from the community
Performance Tuning Guide
The performance is essential for on-policy RL algorithm. We have written a detailed performance tuning guide to help you optimize performance.
Upgrade to vLLM >= v0.8.2
verl now supports vLLM>=0.8.2 when using FSDP as the training backend. Please refer to this document for the installation guide and more information. Please avoid vllm 0.7.x, which contains bugs that may lead to OOMs and unexpected errors.
Use Latest SGLang
SGLang is fully supported with verl, and SGLang RL Group is working extensively on building unique features, including multi-turn agentic RL, VLM RLHF, server-based RL, and partial rollout. Please refer to this document for the installation guide and more information.
Upgrade to FSDP2
verl is fully embracing FSDP2! FSDP2 is recommended by torch distributed team, providing better throughput and memory usage, and is composible with other features (e.g. torch.compile). To enable FSDP2, simply use verl main and set the following options:
Furthermore, FSDP2 cpu offloading is compatible with gradient accumulation. You can turn it on to save memory with
actor_rollout_ref.actor.fsdp_config.offload_policy=True. For more details, see https://github.com/volcengine/verl/pull/1026AMD Support (ROCm Kernel)
verl now supports FSDP as the training engine (Megatron support coming soon) and both integrates with vLLM and SGLang as inference engines. Please refer to this document for the installation guide and more information, and this document for the vLLM performance tuning for ROCm.
Citation and acknowledgement
If you find the project helpful, please cite:
verl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and contributed by Bytedance, Anyscale, LMSys.org, Alibaba Qwen team, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, University of Hong Kong, ke.com, All Hands AI, ModelBest, JD AI Lab, Microsoft Research, StepFun, Amazon, LinkedIn, Meituan, Camel-AI, OpenManus, Xiaomi, NVIDIA research, Baichuan, RedNote, SwissAI, Moonshot AI (Kimi), Baidu, Snowflake, Skywork.ai, JetBrains, IceSword Lab, and many more.
Awesome work using verl
and many more awesome work listed in recipe.
Contribution Guide
See contributions guide
About ByteDance Seed Team
Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry’s most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society. You can get to know Bytedance Seed better through the following channels👇
We are HIRING! Send us an email if you are interested in internship/FTE opportunities in RL for agents.