目录
quot;, ".*\\.down_proj\\.weight
quot;], "partition_type": "row" }, { "patterns": [".*\\.[qkv]_proj\\.weight
quot;], "partition_type": "column" }, { "patterns": [".*\\.gate_up_proj\\.weight
quot;], "partition_type": "column", "shape": [2, -1], "partition_dim": 0 } ] } } } ``` Refer to the [document](https://github.com/tohtana/DeepSpeed/blob/tohtana/autotp_custom_patterns/docs/code-docs/source/training.rst) for more details (including preset models and how to define partitioning for fused models). We also opened a new [PR](https://github.com/deepspeedai/DeepSpeedExamples/pull/998) to show the usage. ## Simplified initialization step AutoTP previously required calling ``set_autotp_mode(training=True)`` and ``deepspeed.tp_model_init`` before ``deepspeed.initialize``. Now we can include all the necessary configurations in the DeepSpeed config. We still support the traditional initialization path for backward compatibility. When you use both (i.e. calling ``set_autotp_mode(training=True)`` and ``deepspeed.tp_model_init`` and passing the config to ``deepspeed.initialize``), we will merge the settings at initialization. When we have conflicting settings, we will error out. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>" href="/mirrors/DeepSpeed/commits/6b9cab1dd5">Support custom partitioning patterns for AutoTP (#7806)
3个月前
  • ciUpdate PyTorch to v2.9 for modal tests (#7816)3个月前
  • csrcXPU use stock pytorch instead of Intel Extension for PyTorch (#7877)2个月前
  • deepspeedRemove amp() from abstract accelerator (#7879)2个月前
  • dockerUpdate GH org references (#6998)1年前
  • docsAdd document section explaining autocast nesting (#7883)1个月前
  • examplesUpdate GH org references (#6998)1年前
  • op_builderXPU use stock pytorch instead of Intel Extension for PyTorch (#7877)2个月前
  • releaseReplace calls to `python setup.py sdist` with `python -m build --sdist` (#7069)1年前
  • requirementsXPU use stock pytorch instead of Intel Extension for PyTorch (#7877)2个月前
  • scriptsscripts: Check .is_cuda only in non-C++ files (#7561)7个月前
  • testsXPU use stock pytorch instead of Intel Extension for PyTorch (#7877)2个月前
  • .clang-formatforce set lf instead of crlf (https://github.com/pre-commit/pre-commit-hooks#mixed-line-ending) (#1598)4年前
  • .flake8Re-enable GPT-J unit tests and refactor inference tests (#3618)2年前
  • .gitignoreAdd venv to .gitignore (#7605)7个月前
  • .gitmodulesDeepSpeed-FastGen (#4604)2年前
  • .pre-commit-config.yamlUpdate pre-commit version (#6821)1年前
  • .pylintrcAdd codespell to pre-commit checks (#1717)4年前
  • .readthedocs.ymlFix RTD builds (#4558)2年前
  • .style.yapfupdate formatter version and style settings (#3098)3年前
  • CODEOWNERSUpdate code owners (#6890)1年前
  • CODE_OF_CONDUCT.mdforce set lf instead of crlf (https://github.com/pre-commit/pre-commit-hooks#mixed-line-ending) (#1598)4年前
  • COMMITTERS.mdUpdate TSC Committers (#7517)8个月前
  • CONTRIBUTING.mdadd `Makefile` to ease maintenance (#7267)12个月前
  • GOVERNANCE.mdAdding the governance doc (#6748)1年前
  • LICENSEadded full apache license (#3119)3年前
  • MANIFEST.in[manifest] update mainfest to add hpp file in deepspeed. (#5533)1年前
  • MANIFEST_win.inAbstract accelerator (step 2) (#2560)3年前
  • Makefileadd `Makefile` to ease maintenance (#7267)12个月前
  • README.mdAdd news entry for DeepSpeed updates (#7854)2个月前
  • SECURITY.mdUpdate SECURITY.md to point to GitHub reporting rather than Microsoft (#7692)5个月前
  • build_win.batDeepCompile for enhanced compiler integration (#7154)1年前
  • environment.ymlIntroduce pydantic_v1 compatibility module for pydantic>=2.0.0 support (#4407)2年前
  • install.shReplace calls to `python setup.py sdist` with `python -m build --sdist` (#7069)1年前
  • setup.cfgSeeded unit tests (#1072)5年前
  • setup.pyUpdate email address (#7624)6个月前
  • version.txtMerge pull request #7850 from deepspeedai/loadams/update-post-0.18.62个月前
  • License Apache 2.0 PyPI version Downloads Build OpenSSF Best Practices Twitter Japanese Twitter Chinese Zhihu Slack

    Latest News

    More news

    Extreme Speed and Scale for DL Training

    DeepSpeed enabled the world’s most powerful language models (at the time of this writing) such as MT-530B and BLOOM. DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations include ZeRO, ZeRO-Infinity, 3D-Parallelism, Ulysses Sequence Parallelism, DeepSpeed-MoE, etc.


    DeepSpeed Adoption

    DeepSpeed was an important part of Microsoft’s AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

    DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you’d like to include your model please submit a PR):

    DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

    Documentation
    Transformers with DeepSpeed
    Accelerate with DeepSpeed
    Lightning with DeepSpeed
    MosaicML with DeepSpeed
    Determined with DeepSpeed
    MMEngine with DeepSpeed

    Build Pipeline Status

    Description Status
    NVIDIA nv-pre-compile-ops aws-torch-latest
    AMD amd-mi200
    CPU torch-latest-cpu
    Intel Gaudi hpu-gaudi2
    Intel XPU xpu-max1100
    Integrations aws-accelerate
    Misc Formatting pages-build-deployment Documentation Statuspython
    Huawei Ascend NPU Huawei Ascend NPU

    Installation

    The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our ‘ops’. By default, all of these extensions/ops will be built just-in-time (JIT) using torch’s JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

    Requirements

    • PyTorch must be installed before installing DeepSpeed.
    • For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
    • A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
    • Specific GPUs we develop and test against are listed below, this doesn’t mean your GPU will not work if it doesn’t fall into this category it’s just DeepSpeed is most well tested on the following:
      • NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
      • AMD: MI100 and MI200

    Contributed HW support

    • DeepSpeed now support various HW accelerators.
    Contributor Hardware Accelerator Name Contributor validated Upstream validated
    Huawei Huawei Ascend NPU npu Yes No
    Intel Intel(R) Gaudi(R) 2 AI accelerator hpu Yes Yes
    Intel Intel(R) Xeon(R) Processors cpu Yes Yes
    Intel Intel(R) Data Center GPU Max series xpu Yes Yes
    Tecorigin Scalable Data Analytics Accelerator sdaa Yes No

    PyPI

    We regularly push releases to PyPI and encourage users to install from there in most cases.

    pip install deepspeed

    After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

    ds_report

    If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

    Windows

    Many DeepSpeed features are supported on Windows for both training and inference. You can read more about this in the original blog post here. Among features that are currently not supported are async io (AIO) and GDS (which does not support Windows).

    1. Install PyTorch, such as pytorch 2.3+cu121.
    2. Install Visual C++ build tools, such as VS2022 C++ x64/x86 build tools.
    3. Launch Cmd console with Administrator permissions for creating required symlink folders and ensure MSVC tools are added to your PATH or launch the Developer Command Prompt for Visual Studio 2022 with administrator permissions.
    4. Run build_win.bat to build wheel in dist folder.

    Further Reading

    All DeepSpeed documentation, tutorials, and blogs can be found on our website: deepspeed.ai

    Description
    Getting Started First steps with DeepSpeed
    DeepSpeed JSON Configuration Configuring DeepSpeed
    API Documentation Generated DeepSpeed API documentation
    Tutorials Tutorials
    Blogs Blogs

    CI funding

    This being an open source project we rely on others to provide us resources for CI hardware. At this moment Modal is kindly supporting our GPU CI runs by funding the hardware for us. Modal is an AI infrastructure platform for inference, fine-tuning, batch jobs and more. Get started with $30/mo in free credits today at https://modal.com. We have been getting an amazing support from Modal’s team and will surely recommend them to your business.

    Contributing

    DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.
    Thanks so much to all of our amazing contributors!

    Developer Certificate of Origin

    This project welcomes contributions and suggestions. Most contributions require you to agree to a Developer Certificate of Origin DCO stating that they agree to the terms published at https://developercertificate.org for that particular contribution.

    DCOs are per-commit, so each commit needs to be signed off. These can be signed in the commit by adding the -s flag. DCO enforcement can also be signed off in the PR itself by clicking on the DCO enforcement check.

    Code of Conduct

    This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

    Publications

    1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ‘20).

    2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘20, Tutorial).

    3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.

    4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 and USENIX ATC 2021. [paper] [slides] [blog]

    5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. arXiv:2102.02888 and ICML 2021.

    6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857 and SC 2021. [paper] [slides] [blog]

    7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed. arXiv:2104.06069 and HiPC 2022.

    8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. arXiv:2108.06084 and NeurIPS 2022.

    9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.

    10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv:2201.05596 and ICML 2022. [pdf] [slides] [blog]

    11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model arXiv:2201.11990.

    12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv:2206.01859 and NeurIPS 2022.

    13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv:2206.01861 and NeurIPS 2022 [slides] [blog]

    14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 and SC 2022. [paper] [slides] [blog]

    15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. arXiv:2211.11586.

    16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597 ENLSP2023 Workshop at NeurIPS2023

    17. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv:2301.12017 and ICML2023.

    18. Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. ICLR:2023.

    19. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 and Finding at EMNLP2023.

    20. Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning arXiv:2303.08374 and will appear at IPDPS 2023.

    21. Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele. (2023) A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training arXiv:2303.06318 and ICS 2023.

    22. Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He. (2023) ZeRO++: Extremely Efficient Collective Communication for Giant Model Training arXiv:2306.10209 and ML for Sys Workshop at NeurIPS2023 [blog]

    23. Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He. (2023) ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation arXiv:2303.08302 and ENLSP2023 Workshop at NeurIPS2023 [slides]

    24. Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He. (2023) Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important? arXiv:2305.09847

    25. Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He. (2023) DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales arXiv:2308.01320.

    26. Xiaoxia Wu, Zhewei Yao, Yuxiong He. (2023) ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats arXiv:2307.09782 and ENLSP2023 Workshop at NeurIPS2023 [slides]

    27. Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He. (2023) DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention arXiv:2309.14327

    28. Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, et al. (2023) DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies arXiv:2310.04610 [blog]

    29. Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He. (2023) ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers arXiv:2310.17723

    30. Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao (2023) ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks arXiv:2312.08583

    31. Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. (2024) FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design arXiv:2401.14112

    32. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. (2024) System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    33. Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang. (2024) Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training arXiv:2406.18820

    34. Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He. (2025) Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences arXiv:2506.13996

    35. Tingfeng Lan, Yusen Wu, Bin Ma, Zhaoyuan Su, Rui Yang, Tekin Bicer, Masahiro Tanaka, Olatunji Ruwase, Dong Li, Yue Cheng. (2025) ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates arXiv:2505.12242

    36. Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang. (2026) SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips arxiv, ASPLOS 2026

    Videos

    1. DeepSpeed KDD 2020 Tutorial
      1. Overview
      2. ZeRO + large model training
      3. 17B T-NLG demo
      4. Fastest BERT training + RScan tuning
      5. DeepSpeed hands on deep dive: part 1, part 2, part 3
      6. FAQ
    2. Microsoft Research Webinar
    3. DeepSpeed on AzureML
    4. Large Model Training and Inference with DeepSpeed // Samyam Rajbhandari // LLMs in Prod Conference [slides]
    5. Community Tutorials
    邀请码
      Gitlink(确实开源)
    • 加入我们
    • 官网邮箱:gitlink@ccf.org.cn
    • QQ群
    • QQ群
    • 公众号
    • 公众号

    版权所有:中国计算机学会技术支持:开源发展技术委员会
    京ICP备13000930号-9 京公网安备 11010802032778号