NCCL performing better with synchronization

Hi,

I am working on an iterative solver code using CUDA for local computations and a CUDA-aware MPI for communication. After realizing that MPI performs poorly on Reduce and ReduceScatter collectives with large message sizes on Perlmutter, I switched to NCCL (2.18.3) and obtained a decent speedup over Cray-MPICH (8.1.28 with OFI).

But recently, I observed something interesting that placing cudeDeviceSynchronize() before and after NCCL collectives improves the performance even further consistently on different inputs utilizing a varying number of GPUs. For example, the code that takes about 7 seconds to run on 190 GPUs now finishes around 3 seconds with synchronization.

Just to help visualize the flow, I am attaching below the computational graph without any synchronization:

and with synchronization before and after each collective:

I would expect the first one to perform better as it allows overlapping communication with computation. I could not explain why this is not the case and was hoping you could help me understand what I am observing better.

Why do you think adding synchronization improves the time to solution?

Thanks!

1 Like