Tensorpipe based Van implementation for TCP (#111)
tensorpipe based van impl
cross-process memory channel for intra-machine communication
add an option to use another tp context for receiving
add ZPushPull
Revert “add ZPushPull”
This reverts commit 29bf91f8f0b896f0972fcc4fc03e26e028dcd37f.
improve build & always use recv_ctx & re-enable cma channel
fix use without tp by adding ‘CFLAGS += -DDMLC_USE_TP’
add DMLC_USE_RECVCTX back
Co-authored-by: ziyuehuang ziyuehuang@tencent.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
This is the communication library for BytePS. It is designed for high performance RDMA. However, it also supports TCP.
Build
USE_RDMA=1if you don’t want to build with RDMA ibverbs support.USE_FABRIC=1if you want to build with RDMA libfabric support for AWS Elastic Fabric Adaptor.To build ps-lite with UCX:
BytePS relies on UCXVan for GPU related communication, such as intra-node cuda-IPC, inter-node GPU-to-GPU / GPU-to-CPU communication with GPU-direct RDMA. For the list of transports UCX supports, see link.
Concepts
In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process.
The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process.
A worker process only communicates with server processes, and vice versa. There won’t be any traffic between worker-to-worker, and server-to-server.
Tutorial
After build, you will have two testing applications under
tests/dir, namelytest_benchmarkandtest_ipc_benchmark. Below we elaborate how you can run with them.To debug, set
PS_VERBOSE=1to see important logs during connection setup, andPS_VERBOSE=2to see each message log.1. Basic benchmark
Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance.
For the scheduler:
For the server:
For the worker:
If you want to use libfabric with Amazon Elastic Fabric Adaptor, make sure to set
DMLC_ENABLE_RDMA=fabricfor all processes. If you are using libfabric < 1.10, please also setFI_EFA_ENABLE_SHM_TRANSFER=0to avoid a bug in the EFA shm provider.If you just want to use TCP, make sure to unset
DMLC_ENABLE_RDMAfor all processes.2. Benchmark with IPC support
The
test_ipc_benchmarkdemonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker.Suppose you have two machines. Each machine should launch a worker and a server process.
For the scheduler: (you can launch it on either machine-0 or machine-1)
For machine-0 and machine-1:
Note: This benchmark is only valid for RDMA.
3. Other GPU-related benchmarks