launcher: join workers as they exit (#429)
check worker exit status in the order they exit. This way failed workers can be discovered early, and the entire job terminated as soon as possible.
Signed-off-by: yulu.jia yulu.jia@bytedance.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
BytePS
BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.
BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than Horovod+NCCL. In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.
News
bpslaunchas the command to launch tasks.pip3 install bytepsPerformance
We show our experiment on BERT-large training, which is based on GluonNLP toolkit. The model uses mixed precision.
We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs (32GB memory) with NVLink-enabled. Machines are inter-connected with 100 Gbps RDMA network. This is the same hardware setup you can get on AWS.
BytePS achieves ~90% scaling efficiency for BERT-large with 256 GPUs. The code is available here. As a comparison, Horovod+NCCL has only ~70% scaling efficiency even after expert parameter tunning.
With slower network, BytePS offers even more performance advantages – up to 2x of Horovod+NCCL. You can find more evaluation results at performance.md.
Goodbye MPI, Hello Cloud
How can BytePS outperform Horovod by so much? One of the main reasons is that BytePS is designed for cloud and shared clusters, and throws away MPI.
MPI was born in the HPC world and is good for a cluster built with homogeneous hardware and for running a single job. However, cloud (or in-house shared clusters) is different.
This leads us to rethink the best communication strategy, as explained in here. In short, BytePS only uses NCCL inside a machine, while re-implements the inter-machine communication.
BytePS also incorporates many acceleration techniques such as hierarchical strategy, pipelining, tensor partitioning, NUMA-aware local communication, priority-based scheduling, etc.
Quick Start
We provide a step-by-step tutorial for you to run benchmark training tasks. The simplest way to start is to use our docker images. Refer to Documentations for how to launch distributed jobs and more detailed configurations. After you can start BytePS, read best practice to get the best performance.
Below, we explain how to install BytePS by yourself. There are two options.
Install by pip
Build from source code
You can try out the latest features by directly installing from master branch:
Notes for above two options:
export BYTEPS_NCCL_HOME=/path/to/nccl. By default it points to/usr/local/nccl.yum install devtoolset-7before everything else. In general, we recommend using gcc 4.9 for best compatibility (how to pin gcc).Examples
Basic examples are provided under the example folder.
To reproduce the end-to-end evaluation in our OSDI’20 paper, find the code at this repo.
Use BytePS in Your Code
Though being totally different at its core, BytePS is highly compatible with Horovod interfaces (Thank you, Horovod community!). We chose Horovod interfaces in order to minimize your efforts for testing BytePS.
If your tasks only rely on Horovod’s allreduce and broadcast, you should be able to switch to BytePS in 1 minute. Simply replace
import horovod.tensorflow as hvdbyimport byteps.tensorflow as bps, and then replace allhvdin your code bybps. If your code invokeshvd.allreducedirectly, you should also replace it bybps.push_pull.Many of our examples were copied from Horovod and modified in this way. For instance, compare the MNIST example for BytePS and Horovod.
BytePS also supports other native APIs, e.g., PyTorch Distributed Data Parallel and TensorFlow Mirrored Strategy. See DistributedDataParallel.md and MirroredStrategy.md for usage.
Limitations and Future Plans
BytePS does not support pure CPU training for now. One reason is that the cheap PS assumption of BytePS do not hold for CPU training. Consequently, you need CUDA and NCCL to build and run BytePS.
We would like to have below features, and there is no fundamental difficulty to implement them in BytePS architecture. However, they are not implemented yet:
Publications
[OSDI’20] “A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters“. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, Chuanxiong Guo.
[SOSP’19] “A Generic Communication Scheduler for Distributed DNN Training Acceleration“. Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, Chuanxiong Guo. (Code is at bytescheduler branch)