I took a trace of a piece of a cuda-aware MPI application and it seems that cuStreamSynchronize has a 10x overhead. What is odd is that the memcpy finished a long time before the synchronize came back ( please see attached snapshot). What can the synchronization (Driver) be doing to not come back after the transfer is complete?
The screenshot looks like it was taken from Nsight Systems tool (while this forum branch is dedicated to cuda-gdb support). Could you move the topic to Nsight Systems forum branch: Nsight Systems - NVIDIA Developer Forums