90% time taken by cuStreamSynchronize

I have parallelized my application using OpenACC. I ran it using nvprof to output the application profile with the objective to optimize it further.

The profile shows the time taken by user-level functions as expected. However, as far as the cuda API calls are concerned, it shows that 90% of the time was spent in cuStreamSynchronize.

Is this indicative of some typical bottleneck? Based on this information, can you suggest a possible optimization? I believe cuStreamSynchronize indicates a large overhead of synchronizing vector threads. Maybe loop fusion would help? Or maybe combine kernel regions if possible?

Thanks for your help.

Hi Kshtij,

cuStreamSynchronize is where the host will block waiting for the kernels to return. It’s not really spending time in cuStreamSynchronize but rather this is the time the host is waiting while the GPU is doing useful work.

Can you look at a timeline view in NVVP to see if “cuStreamSynchronize” spans across your compute kernels? This should give you a better idea of where the time is being spent.

Hope this helps,

Yep, I see in the timeline view that cuStreamSynchronize is recorded when a kernel is being executed on the GPU.

Thanks Mat.