Strange cudaLaunch stall in NV Visual Profiler

Hi all,

I am working with a cuda code that works well, but shows some strange behavior concerning 2 or 3 function calls that are showed to take extremely long time in the profiler. (see picture provided).
It is often a ‘cudaLaunch’ that take 110 ms according to the profiler, and sometimes a ‘cudaMemcpy’…

I am using some thrust, combine with homemade kernels, and I want to stress out that the kernels before the cudaLaunch ‘gap’ and after are the same.

I do not expect a direct solution in this forum, without the code (that I can’t provide) but has anyone seen this ?
-> Could be a NVVP display bug ?
-> Seeing the speed of the release code outside nvvp, I suppose this phenomenon is not happening.
-> how to be sure the profiler is telling me the truth ?
-> does kernel launching can be stalled if many many kernels are launched in asynchronous mode ?

Feel free to give me your insight on this one !


Replying to myself, because I have found the reason, this is due to nvvp, when it is actually flushing its profiling data, (or else).
This can artificially create this “stall” on any function, but with the new version of nvvp (Cuda 5.0) this time is marked in red as “non accountable in real execution time”.

So much clearer now.