Kernel Overhead/Profiler Accuracy

I have a kernel that takes that takes ~4ms to execute based on using the QueryPerformanceCounters and QueryPerformanceFrequency methods in Windows. According to CUDA’s visual profiler, the GPU execution time is 70us. Thus, can kernels really have overhead in the ms range or is there a problem with the profiler?

version 0.2 of the visual profiler also shows the CPU time, does that value also differ so much to the queryperformancecounters?

For me the output of the visual profiler has been very stable over time, so I have the feeling that the output can be trusted.

Well, if you give Nvidia the benefit of the doubt on the profiler, the next question is why do some kernels have an overhead of ~40us and others have an overhead of ~4ms. What could possibly increase the CPU latency and overhead two orders of magnitude for that kernel beyond normal?

Do you have other threads or background tasks running? When profiling, there is an implicit thread synchronize after every kernel call which spin-waits with a thread yield (at least in CUDA 1.1, I’m not sure if this changed in 2.0 beta). If other threads are vying for CPU time, this can introduce a significant delay.

Although, 4ms is a little bit excessive. I’ve personally never seen that kind of kernel launch overhead.

I had a cudaThreadSynchronize() after the kernel invocation to prevent the QueryPerformanceCounter from executing prior to the end of the kernel. That, unlike cudaStreamQuery yields the processor. Now, that I see what the problem is, it should have been really obvious to me.