cudaStreamSynchronize blocking for long time after kernels finished

In the following screenshot why does cudaStreamSynchronize block the host for so long after the kernels have finished?

Both kernels are reading from zeroCopy memory, doing some processing then writing to different zeroCopy memory.


Would you mind to share the profiling file with us so we can check it further?

