I am working with a cuda code that works well, but shows some strange behavior concerning 2 or 3 function calls that are showed to take extremely long time in the profiler. (see picture provided).
It is often a ‘cudaLaunch’ that take 110 ms according to the profiler, and sometimes a ‘cudaMemcpy’…
I am using some thrust, combine with homemade kernels, and I want to stress out that the kernels before the cudaLaunch ‘gap’ and after are the same.
I do not expect a direct solution in this forum, without the code (that I can’t provide) but has anyone seen this ?
-> Could be a NVVP display bug ?
-> Seeing the speed of the release code outside nvvp, I suppose this phenomenon is not happening.
-> how to be sure the profiler is telling me the truth ?
-> does kernel launching can be stalled if many many kernels are launched in asynchronous mode ?
Feel free to give me your insight on this one !