Enabling Fast Inference and Resilient Training with NCCL 2.27

jwitsoe · July 14, 2025, 7:23pm

Originally published at: Enabling Fast Inference and Resilient Training with NCCL 2.27 | NVIDIA Technical Blog

As AI workloads scale, fast and reliable GPU communication becomes vital, not just for training, but increasingly for inference at scale. The NVIDIA Collective Communications Library (NCCL) delivers high-performance, topology-aware collective operations: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter optimized for NVIDIA GPUs and a variety of interconnects including PCIe, NVLink, Ethernet (RoCE), and InfiniBand (IB).…

jmora · July 16, 2025, 7:59pm

Nice trick to lower latency. Also nice to see an effort around reliability at scale (NCCL shrink). I am calling that challenge the next “scalability wall”. It will be interesting to see how that fault tolerance API is going to be leveraged by the AI/HPC frameworks from algorithm and programming stand point… We are developing a complementary solution to that challenge that should make it easier for everyone.

Going back to NCCL perf numbers, it would be great to see a chart of effective allreduce bw for different message sizes (64k, 128k,…1GB,2GB) for pow2 GPUs from 8,16,32,64,.. all the way to 2048 GPUs. How much it drops effective bw per link when you go from 1k GPUs to 2k GPUs at allreduce 8MB ? Or what is the message size needed to flood (max bw per link) for the 2k GPUs at allreduce ? This is important when sizing the network infrastructure for large scale to avoid overprovisioning the network.

Topic		Replies	Views
Scaling Deep Learning Training with NCCL Technical Blog	1	889	November 6, 2018
NCCL 2.27을 활용한 빠른 추론과 안정적인 학습 구현 Technical Blog - South Korea	1	61	July 22, 2025
New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23 Technical Blog	1	119	January 31, 2025
Fast Multi-GPU collectives with NCCL Technical Blog	14	1370	May 11, 2018
Improved Performance and Monitoring Capabilities with NVIDIA Collective Communications Library 2.26 Technical Blog	1	90	June 25, 2025
Memory Efficiency, Faster Initialization, and Cost Estimation with NVIDIA Collective Communications Library 2.22 Technical Blog	1	105	September 17, 2024
Doubling all2all Performance with NVIDIA Collective Communication Library 2.12 Technical Blog	0	845	February 28, 2022
Networking Reliability and Observability at Scale with NCCL 2.24 Technical Blog	2	123	August 11, 2025
Building Scalable and Fault-Tolerant NCCL Applications Technical Blog	1	69	November 10, 2025
Gaps get bigger and computation gets slower when overlapped with NCCL communication CUDA Programming and Performance	0	615	August 14, 2020

Enabling Fast Inference and Resilient Training with NCCL 2.27

Related topics