I have a loop that processes a series of kernels a 1000 times. Putting a cudaEventRecord, cudaEventSynchronize pair before and after the loop results in an elapsed time of 620ms. Using the Visual profiler, the total GPU execution time is 170ms. Why am I seeing such a large discrepancy?
You are seeing the overhead of calling the kernel. Profiler tells you how much time the GPU was busy, the other one, how much time it also took the CPU to prepare input for kernels and start them.
.17 usec for a kernel is not a lot of time, so the overhead is relatively high. If you have 10 usec of overhead on a kernel call that takes the GPU 1 sec to process, it is noise.
Well, I call four kernels in each iteration of the loop. Attributing the difference between the gpu time and the elapsed time as overhead, that results in 450us of overhead per an iteration or ~110us of overhead per a kernel invocation. That seems quite severe, but I’ll try to combine multiple small kernel invocations into a single large kernel.
When I comment out all the code in the kernels, the cudaEventSynchronize, cudEventRecord pairs results in 100ms of execution time. The gpu execution time and the overhead of running dummy kernels = 170ms + 100ms = 270ms which is still quite a bit less than the 620ms that I see when running the kernels code. Why would the whole be so much more than the sum of the parts?
overhead depends on amount of input into the kernel, the size of the kernel (which you changed by commenting it out), use of textures, etc.
I found that selecting more variables for the profiler to capture can add significantly to the execution time, so some of what I have I seen can be attributed to the software overhead of the profiler. I read that the clock speed is lowered to use some hardware counters, so that may have been additional factor.
Okay, now that is interesting, I will have to run my code with the profiler only recording time. Standard I watch all signals, just to make sure I can find out what happened when I see something strange.