gin: add abstract base classes and rename EP
Add abstract GIN base classes in nccl_ofi_gin_base.h: nccl_ofi_gin_ep_t, nccl_ofi_gin_listen_comm_t, nccl_ofi_gin_put_comm_t, nccl_ofi_gin_symm_mr_handle_t, and nccl_ofi_gin_req_t with a default test() that asserts.
Rename the concrete GIN EP from nccl_ofi_gin_ep_t to nccl_ofi_rdma_gin_ep_t and have it inherit from the abstract base. Concrete listen_comm, put_comm, symm_mr_handle, and base_req also inherit from their respective abstract bases.
Remove gin_cq_process_max_iter global variable and cache the value on the GIN EP at construction time.
Remove gin/nccl_ofi_gin_types.h include from nccl_ofi.h and add it to gin_reqs.h for self-containment.
Signed-off-by: Hershel Shah hershys@amazon.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
AWS OFI NCCL
AWS OFI NCCL is a plug-in which enables EC2 developers to use libfabric as a network provider while running NVIDIA’s NCCL based applications.
This plug-in also has support for libfabric as a network provider while running AMD’s RCCL based applications.
Overview
Machine learning frameworks running on top of NVIDIA GPUs use a library called NCCL which provides standard collective communication routines for an arbitrary number of GPUs installed across single or multiple nodes.
This project implements a plug-in which maps NCCLs connection-oriented transport APIs to libfabric’s connection-less reliable interface. This allows NCCL applications to take benefit of libfabric’s transport layer services like reliable message support and operating system bypass.
Getting Started
The best way to build the plugin is to start with the latest release package. The plugin developers highly discourage customers from building directly from the HEAD of a GitHub branch, as releases go through more extensive testing than the pre-commit testing on git branches. More information about installing the plugin from a released tarball can be found in INSTALL.md.
Version numbers that end in
-awshave only been tested on Amazon Web Services Elastic Compute Cloud (EC2) instances and the Elastic Fabric Adapter (EFA) network transport. Customers using other networks may experience unexpected issues with these releases, but we welcome bug reports if that is the case.Basic Requirements
The plugin is regularly tested on the following operating systems:
Other operating systems are likely to work, but are not included in our regular regression testing. If you find an issue unique to another operating system, GitHub issues or (better yet) patches are appreciated.
To build the plugin, you need to have Libfabric and HWLOC installed prior to building the plugin. If you want to run the included multi-node tests, you also need an MPI Implementation installed. Each release of the plugin has a list of dependency versions in the top-level README.md file.
The plugin does not require NCCL to be pre-installed, but obviously a NCCL installation is required to use the plugin. As of NCCL 2.4.8, it is possible to use the same plugin build across multiple versions of NCCL (such as those installed per-package with Conda-like environments).
Most Libfabric providers should work with the plugin, possibly through a utility provider. The plugin generally requires Reliable datagram endpoints (
FI_EP_RDM) with tagged messaging (FI_TAGGED,FI_MSG). This is similar to the requirements of most MPI implementations and a generally tested path in Libfabric. For GPUDirect RDMA support, the plugin also requiresFI_HMEMsupport, as well as RDMA support.Getting Help
If you have any issues in building or using the package or if you think you may have found a bug, please open an issue.
Contributing
Reporting issues and sending pull requests are always welcome. To learn how you can contribute, please look at our contributing guidelines.
License
This library is licensed under the Apache 2.0 License.