Scaling Deep Learning Training with NCCL

Originally published at:

NVIDIA Collective Communications Library (NCCL) provides optimized implementation of inter-GPU communication operations, such as allreduce and variants. Developers using deep learning frameworks can rely on NCCL’s highly optimized, MPI compatible and topology aware routines, to take full advantage of all available GPUs within and across multiple nodes. NCCL is optimized for high bandwidth and low latency…


My cluster is equipped with both PCIe switch and NvLink. As known, NCCL will automatically choose reduction method regarding to topology. How can I subjectively choose the interconnect, PCIe or NvLink, by using knob?

J su