Gaps get bigger and computation gets slower when overlapped with NCCL communication

DunLiang · August 14, 2020, 7:57am

Hello, Dear NVIDIA engineers, I am writing a distribution resnet50 training program. But I cannot get linear speedup for 2 GPU. So I use nvprof to dump the timeline, and found some strange behaviors:

Gaps get bigger and computation gets slower when overlapped with NCCL communication.
Those gaps make my distribution training program only get 1.8 speedup for 2 GPUs.
NCCL version is 2.6, CUDA version 10.2.
How can I improve it? Do I miss some configurations?
I found a simular issues on github, but no one replies it yet:
https://github.com/NVIDIA/nccl/issues/357

Topic		Replies	Views
Fast Multi-GPU collectives with NCCL Technical Blog	14	1335	May 11, 2018
Scaling Deep Learning Training with NCCL Technical Blog	1	881	November 6, 2018
NCCL performing better with synchronization GPU-Accelerated Libraries nccl	0	504	April 3, 2024
Enabling Fast Inference and Resilient Training with NCCL 2.27 Technical Blog	2	131	July 16, 2025
Nsight system profilling "GR active and SM active" Profiling Linux Targets cuda , kernel	11	665	September 21, 2024
How to check what is slowing down the kernel CUDA Programming and Performance	0	604	January 31, 2022
Improved Performance and Monitoring Capabilities with NVIDIA Collective Communications Library 2.26 Technical Blog	1	86	June 25, 2025
Multi-GPU Training time is slower than single-GPU CUDA Programming and Performance	0	507	February 2, 2023
New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23 Technical Blog	1	115	January 31, 2025
NCCL and D2D data moving across GPU devices CUDA Programming and Performance	0	1202	October 28, 2017

Gaps get bigger and computation gets slower when overlapped with NCCL communication

Related topics