Hi,
I am working on an iterative solver code using CUDA for local computations and a CUDA-aware MPI for communication. After realizing that MPI performs poorly on Reduce and ReduceScatter collectives with large message sizes on Perlmutter, I switched to NCCL (2.18.3) and obtained a decent speedup over Cray-MPICH (8.1.28 with OFI).
But recently, I observed something interesting that placing cudeDeviceSynchronize() before and after NCCL collectives improves the performance even further consistently on different inputs utilizing a varying number of GPUs. For example, the code that takes about 7 seconds to run on 190 GPUs now finishes around 3 seconds with synchronization.
Just to help visualize the flow, I am attaching below the computational graph without any synchronization:
and with synchronization before and after each collective:
I would expect the first one to perform better as it allows overlapping communication with computation. I could not explain why this is not the case and was hoping you could help me understand what I am observing better.
Why do you think adding synchronization improves the time to solution?
Thanks!