[train]: fix bug of deepseek_v4 support pr. (#47)
Description
Bug fix PR Description It is missed in previous deepseek_v4 support pr. And it will cause error when moe forwarding. It is only used in moe bias_act_func currently.
Type of change
- New feature (non-breaking change which adds functionality)
- Infra/Build change (changes to CI/CD workflows or build scripts)
- Code refactoring
- Documentation change
- [✅ ] Bug fix
- Breaking change
Changes
- Content 1
- Content 2
- Content 3
- Content 4
Checklist
- I have read and followed the contributing guidelines
- The functionality is complete
- I have commented my code, particularly in coverage report uploading steps
- I have made corresponding changes to the documentation
- My changes generate no new warnings
- I have added/updated tests that prove my feature works
- New and existing unit tests pass locally
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
Megatron-LM-FL is a fork of Megatron-LM that introduces a plugin-based architecture for supporting diverse AI chips, built on top of FlagOS, a unified open-source AI system software stack.
Megatron-LM and Megatron Core
GPU-optimized library for training transformer models at scale
About
This repository contains two components: Megatron-LM and Megatron Core.
Megatron-LM is a reference example that includes Megatron Core plus pre-configured training scripts. Best for research teams, learning distributed training, and quick experimentation.
Megatron Core is a composable library with GPU-optimized building blocks for custom training frameworks. It provides transformer building blocks, advanced parallelism strategies (TP, PP, DP, EP, CP), mixed precision support (FP16, BF16, FP8, FP4), and model architectures. Best for framework developers and ML engineers building custom training pipelines.
Megatron Bridge provides bidirectional Hugging Face ↔ Megatron checkpoint conversion with production-ready recipes.
Getting Started
Install from PyPI:
Or clone and install from source:
For NGC container setup and all installation options, see the Installation Guide.
Latest News
Previous News
Project Structure
Performance Benchmarking
For our latest performance benchmarking results, please refer to NVIDIA Megatron Bridge Performance Summary.
Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.
Benchmark Configuration:
--overlap-grad-reduce,--overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)Key Results:
Weak Scaling Results
Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.
Strong Scaling Results
We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.
Roadmaps
Resources
Getting Help
Contributing
We ❤️ contributions! Ways to contribute:
→ Contributing Guide
Citation
If you use Megatron in your research or project, we appreciate that you use the following citations: