NVIDIA CUDA communications
NVIDIA creates a friendly solution to this interconnect issue by providing higher bandwidth that calls NVIDIA Collective Communications Library (NCCL). This library provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, optimized to achieve high bandwidth over PCIe and NVLINK high-speed interconnect and implements multi-GPU and multi-node collective communication primitives that are performance-optimized for NVIDIA GPUs on NVLINK technology to interconnects. NCCL is a library of multi-GPU collective communication primitives that are topology-aware and easily integrated into your application. Initially developed as an open-source research project, NCCL is lightweight, depending only on the usual C++ and CUDA libraries.
- NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS
- NCCL Collective Communication Pattern Basics
- Fusing Communication and Compute with New Device API and Copy Engine Collectives in NVIDIA NCCL 2.28
- Tutorial: GPU Communication Libraries for Accelerating HPC and AI Applications (YouTube)
- NVIDIA-NCCL-EXAMPLE (GitHub)
- Sample Codes using NCCL on Multi-GPU (GitHub)
- NCCL Tests (GitHub)
- Design and Implementation of MPI-Native GPU-Initiated MPI Partitioned Communication
- The GPU Communication Stack: What Happens Between GPUs When You Scale AI