CUDA Profiler Cost? How much time is added and where?

Hi,

when running the CUDA profiler on code, can anything be said about the cost of profiling? I notice that a 20ms execution time becomes around 25-30ms with the profiler.

  • Is this time cost added to either of the gpu_time or cpu_time fields in the profiler output?

(I imagine that it works like this: The GPU time is unaffected, but there’s a small delay after each kernel execution while the profiler data is gathered up.)

Not sure if it still prevents overlap as of 2.2, but I think the way you describe it is basically correct for a single kernel.