Originally published at: Enabling Fast Inference and Resilient Training with NCCL 2.27 | NVIDIA Technical Blog
As AI workloads scale, fast and reliable GPU communication becomes vital, not just for training, but increasingly for inference at scale. The NVIDIA Collective Communications Library (NCCL) delivers high-performance, topology-aware collective operations: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter optimized for NVIDIA GPUs and a variety of interconnects including PCIe, NVLink, Ethernet (RoCE), and InfiniBand (IB).…
Nice trick to lower latency. Also nice to see an effort around reliability at scale (NCCL shrink). I am calling that challenge the next “scalability wall”. It will be interesting to see how that fault tolerance API is going to be leveraged by the AI/HPC frameworks from algorithm and programming stand point… We are developing a complementary solution to that challenge that should make it easier for everyone.
Going back to NCCL perf numbers, it would be great to see a chart of effective allreduce bw for different message sizes (64k, 128k,…1GB,2GB) for pow2 GPUs from 8,16,32,64,.. all the way to 2048 GPUs. How much it drops effective bw per link when you go from 1k GPUs to 2k GPUs at allreduce 8MB ? Or what is the message size needed to flood (max bw per link) for the 2k GPUs at allreduce ? This is important when sizing the network infrastructure for large scale to avoid overprovisioning the network.