NCCL performing better with synchronization

absurd.reader · April 3, 2024, 6:04pm

Hi,

I am working on an iterative solver code using CUDA for local computations and a CUDA-aware MPI for communication. After realizing that MPI performs poorly on Reduce and ReduceScatter collectives with large message sizes on Perlmutter, I switched to NCCL (2.18.3) and obtained a decent speedup over Cray-MPICH (8.1.28 with OFI).

But recently, I observed something interesting that placing cudeDeviceSynchronize() before and after NCCL collectives improves the performance even further consistently on different inputs utilizing a varying number of GPUs. For example, the code that takes about 7 seconds to run on 190 GPUs now finishes around 3 seconds with synchronization.

Just to help visualize the flow, I am attaching below the computational graph without any synchronization:

and with synchronization before and after each collective:

I would expect the first one to perform better as it allows overlapping communication with computation. I could not explain why this is not the case and was hoping you could help me understand what I am observing better.

Why do you think adding synchronization improves the time to solution?

Thanks!

Topic		Replies	Views
I am curious about why some program become faster when calling cudaStreamSynchronize during program running. CUDA Programming and Performance	3	803	January 21, 2019
Fast Multi-GPU collectives with NCCL Technical Blog	14	1102	May 11, 2018
synchronization speed improvement why wyncthreads improves performance? CUDA Programming and Performance	1	1795	June 4, 2007
cudaThreadSynchronize() and multiple kernels when is it necessary to sync? CUDA Programming and Performance	2	8350	June 20, 2008
Synchronization methods? CUDA Programming and Performance	11	2170	November 7, 2010
Gaps get bigger and computation gets slower when overlapped with NCCL communication CUDA Programming and Performance	0	531	August 14, 2020
Any cases syncthreads improve performace ? CUDA Programming and Performance	2	1720	September 17, 2010
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9899	February 8, 2008
cudaThreadSynchronize() vs. cudaStreamSynchronize CUDA Programming and Performance	0	5553	January 19, 2010
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4219	May 13, 2010

NCCL performing better with synchronization

Related topics