Supports MLLM, mmDIT, and ViT (#10)
huge refactor on subgraph, compute-optimize strategy separate
different optimize & compute strategy run successfully, rethinking the use of comm stream, subgraph & concat memory optimization also need to be done for grad update sub-topo
refactor pre & post run, support subgraph run with callback function, fix bug of shape mismatches, support single communicator, deprecate the placement group related attributes of op (all move to tensor), new comm type judgement (WIP: different union size)
e2e run, workaround: non-integer splitting
tencent env
change planning to multi-process instead of multi-thread, support fixed-token data loader, packing seqs in descending order
[merge from elastic] change planning to multi-process instead of multi-thread, support fixed-token data loader, packing seqs in descending order
add scripts
tencent hydraulis done
zhiyuan hydraulis done
merge precision alignment branch
hydraulis run successfully, precision seems normal
merge elastic engine into hetu python and refactor
support kv store via grpc and change layernorm to rmsnorm
support producer consumer model on top of kv store
refactor hydraulis planning procedure (integrate with producer-consumer model and kv store) and add hetu python logging
remove figure drawing .py
remove env path
remove more env path
support mllm and mmdit
rename & refactor python files, fix inplace op precision bugs
fix homo cp bwd bugs, no nan
Some amendments to the mllm
Improve the implementation of MLLM and DIT.
Some files missed during the merge process.
Reverted some unnecessary changes.
Reverted some unnecessary changes.
Reverted some unnecessary changes.
Reverted some unnecessary changes.
Fix path and Chinese log issues.
Ensure cub.h is only compiled once.
Ensure CUDABlas.h is only compiled once.
Co-authored-by: lhy101 2000012918@stu.pku.edu.cn Co-authored-by: Fizzmy fizzmyc@gmail.com Co-authored-by: User user@example.com
HETU
Documentation | Examples
Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, developed by DAIR Lab at Peking University. It takes account of both high availability in industry and innovation in academia.
This is the preview of Hetu 2.0, which is still under rapid development. Please raise an issue if you need any help.
We welcome everyone interested in machine learning or graph computing to contribute codes, create issues or pull requests. Please refer to Contribution Guide for more details.
Key Features
Installation
Clone the repository.
Prepare the environment. We use Anaconda to manage packages. The following command create the conda environment to be used:
conda env create -f environment.yml. Please prepare Cuda toolkit, CuDNN, and gRPC in advance.We use CMake to compile Hetu. Please copy the example configuration for compilation by
cp cmake/config.example.cmake cmake/config.cmake. Users can modify the configuration file to enable/disable the compilation of each module. For advanced users (who not using the provided conda environment), the prerequisites for different modules in Hetu is listed in appendix.source hetu.exp.Community
Enterprise Users
If you are enterprise users and find Hetu is useful in your work, please let us know, and we are glad to add your company logo here.
License
The entire codebase is under license
Papers
We have proposed numerous innovative optimization techniques around the Hetu system and published several papers, covering a variety of different model workloads and hardware environments.
Transformer Model & Large Language Model
Mixture-of-experts Model
Embedding Model
Diffusion Model
Graph Neural Network
Decentralized Hetetrogeneous Resources
GPU Kernel
Memory Management
coming soon…
Cite
If you use Hetu in a scientific publication, we would appreciate citations to the following papers:
Acknowledgements
We learned and borrowed insights from a few open source projects including TinyFlow, autodist, tf.distribute, FlexFlow and Angel.
Appendix
The prerequisites for different modules in Hetu is listed as follows: