[CICD] Upload unittest coverage report to FlagCICD platform & Use BAAI runner (#28)
Description
Adds coverage report collection and upload functionality to the common unit test workflow (
unit_tests_common.yml), and updates the MetaX runner label to match the new CI node.Type of change
- New feature (non-breaking change which adds functionality)
- Infra/Build change (changes to CI/CD workflows or build scripts)
- Code refactoring
- Documentation change
- Bug fix
- Breaking change
Changes
- Inlined coverage logic into the
Run unit testsstep: wraps thetorchruninvocation withcoverage run --parallel, then runscoverage combine+coverage jsonafter tests complete, outputting results tocoverage-report/- Added
Upload Coverage Reportstep to save the JSON report as a CI artifact viaactions/upload-artifact@v4- Added
Upload Coverage Report to FlagCICDstep that callsflagos-ai/FlagOps/actions/post-pytest-report@v2to push coverage data to the FlagCICD platform- Updated runner label in
.github/configs/metax.ymltomx-4g-cicd-testChecklist:
- I have read and followed the contributing guidelines
- The functionality is complete
- I have commented my code, particularly in coverage report uploading steps
- I have made corresponding changes to the documentation
- My changes generate no new warnings
- I have added/updated tests that prove my feature works on Cuda and Metax platform
- New and existing unit tests pass locally on Cuda and Metax platform
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
Megatron-LM-FL is a fork of Megatron-LM that introduces a plugin-based architecture for supporting diverse AI chips, built on top of FlagOS, a unified open-source AI system software stack.
Megatron-LM & Megatron Core
GPU-optimized library for training transformer models at scale
⚡ Quick Start
→ Complete Installation Guide - Docker, pip variants (dev,lts,etc.), source installation, and system requirements
Latest News
Previous News
Table of Contents
Getting Started
Core Features
Training
Resources
Megatron Overview
Project Structure
Megatron-LM: Reference Implementation
Reference implementation that includes Megatron Core plus everything needed to train models.
Best for:
What you get:
Megatron Core: Composable Library
Composable library with GPU-optimized building blocks for custom training frameworks.
Best for:
What you get:
Ecosystem Libraries
Libraries used by Megatron Core:
Libraries using Megatron Core:
Compatible with: Hugging Face Accelerate, Colossal-AI, DeepSpeed
Installation
🐳 Docker (Recommended)
We strongly recommend using the previous releases of PyTorch NGC Container rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month’s NGC container, so this ensures compatibility and stability.
Note: The NGC PyTorch container constraints the python environment globally via
PIP_CONSTRAINT. In the following examples we will unset the variable.This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:
Pip Installation
Megatron Core offers support for two NGC PyTorch containers:
dev: Moving head that supports the most recent upstream dependencieslts: Long-term support of NGC PyTorch 24.01Both containers can be combined with
mlmwhich adds package dependencies for Megatron-LM on top of Megatron Core.For a version of Megatron Core with only torch, run:
System Requirements
Hardware Requirements
Software Requirements
Performance Benchmarking
For our latest performance benchmarking results, please refer to NVIDIA NeMo Framework Performance Summary.
Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.
Benchmark Configuration:
--overlap-grad-reduce,--overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)Key Results:
Weak Scaling Results
Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.
Strong Scaling Results
We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.
Training
Getting Started
Simple Training Example
LLama-3 Training Example
Data Preparation
JSONL Data Format
Basic Preprocessing
Key Arguments
--input: Path to input JSON/JSONL file--output-prefix: Prefix for output binary files (.bin and .idx)--tokenizer-type: Tokenizer type (HuggingFaceTokenizer,GPT2BPETokenizer, etc.)--tokenizer-model: Path to tokenizer model file--workers: Number of parallel workers for processing--append-eod: Add end-of-document tokenParallelism Strategies
Data Parallelism (DP)
Standard Data Parallel
Fully Sharded Data Parallel (FSDP)
Tensor Parallelism (TP)
Split individual model layers across GPUs:
Pipeline Parallelism (PP)
Split model depth across GPUs:
Context Parallelism (CP)
Split long sequences across GPUs for handling long contexts:
Expert Parallelism (EP)
For Mixture of Experts (MoE) models:
Combining Parallelism Strategies
Parallelism Selection Guide
Based on NVIDIA NeMo production configurations:
MoE-Specific Requirements
Important: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), Sequence Parallelism (SP) must be enabled.
Performance Optimizations
--attention-backend--fp8-hybrid--recompute-activations--overlap-grad-reduce--use-distributed-optimizer→ NVIDIA NeMo Framework Performance Tuning Guide - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.
FlashAttention
FlashAttention is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The
flash-attnpackage is also supported via--use-flash-attn.Mixed Precision Training
Activation Checkpointing and Recomputation
Data Parallelism Communication Overlap
Distributed Optimizer
Roadmaps
Stay up-to-date with our development roadmaps and planned features:
More roadmap trackers will be added soon.
Community & Support
Getting Help
Contributing
We ❤️ contributions! Ways to contribute:
→ Contributing Guide
Citation