cudaStreamSynchronize blocking for long time after kernels finished

In the following screenshot why does cudaStreamSynchronize block the host for so long after the kernels have finished?

Both kernels are reading from zeroCopy memory, doing some processing then writing to different zeroCopy memory.


Would you mind to share the profiling file with us so we can check it further?

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one. Thanks