I have parallelized my application using OpenACC. I ran it using nvprof to output the application profile with the objective to optimize it further.
The profile shows the time taken by user-level functions as expected. However, as far as the cuda API calls are concerned, it shows that 90% of the time was spent in cuStreamSynchronize.
Is this indicative of some typical bottleneck? Based on this information, can you suggest a possible optimization? I believe cuStreamSynchronize indicates a large overhead of synchronizing vector threads. Maybe loop fusion would help? Or maybe combine kernel regions if possible?
Thanks for your help.