Long overhead with cuStreamSynchronize with OMPI

Hi guys,
I took a trace of a piece of a cuda-aware MPI application and it seems that cuStreamSynchronize has a 10x overhead. What is odd is that the memcpy finished a long time before the synchronize came back ( please see attached snapshot). What can the synchronization (Driver) be doing to not come back after the transfer is complete?
Thanks, Noob-Noob

Hi @N00b-N00b
The screenshot looks like it was taken from Nsight Systems tool (while this forum branch is dedicated to cuda-gdb support). Could you move the topic to Nsight Systems forum branch: Nsight Systems - NVIDIA Developer Forums

Done. Long overhead with cuStreamSynchronize with OMPI