Profiler v. cudaEventSynchronize

chris22 · March 26, 2008, 6:22pm

I have a loop that processes a series of kernels a 1000 times. Putting a cudaEventRecord, cudaEventSynchronize pair before and after the loop results in an elapsed time of 620ms. Using the Visual profiler, the total GPU execution time is 170ms. Why am I seeing such a large discrepancy?

DenisR · March 26, 2008, 6:59pm

You are seeing the overhead of calling the kernel. Profiler tells you how much time the GPU was busy, the other one, how much time it also took the CPU to prepare input for kernels and start them.

.17 usec for a kernel is not a lot of time, so the overhead is relatively high. If you have 10 usec of overhead on a kernel call that takes the GPU 1 sec to process, it is noise.

chris22 · March 26, 2008, 7:21pm

Well, I call four kernels in each iteration of the loop. Attributing the difference between the gpu time and the elapsed time as overhead, that results in 450us of overhead per an iteration or ~110us of overhead per a kernel invocation. That seems quite severe, but I’ll try to combine multiple small kernel invocations into a single large kernel.

chris22 · March 26, 2008, 7:35pm

When I comment out all the code in the kernels, the cudaEventSynchronize, cudEventRecord pairs results in 100ms of execution time. The gpu execution time and the overhead of running dummy kernels = 170ms + 100ms = 270ms which is still quite a bit less than the 620ms that I see when running the kernels code. Why would the whole be so much more than the sum of the parts?

DenisR · March 26, 2008, 8:57pm

overhead depends on amount of input into the kernel, the size of the kernel (which you changed by commenting it out), use of textures, etc.

chris22 · March 27, 2008, 2:08am

I found that selecting more variables for the profiler to capture can add significantly to the execution time, so some of what I have I seen can be attributed to the software overhead of the profiler. I read that the clock speed is lowered to use some hardware counters, so that may have been additional factor.

DenisR · March 27, 2008, 8:31am

Okay, now that is interesting, I will have to run my code with the profiler only recording time. Standard I watch all signals, just to make sure I can find out what happened when I see something strange.

Topic		Replies	Views
Kernel Overhead/Profiler Accuracy CUDA Programming and Performance	4	6395	May 25, 2008
Visual Profiler: CPU Time? CUDA Programming and Performance	5	3431	March 21, 2008
Execution time is different in Profiller and Console. why? CUDA Programming and Performance	4	3742	August 3, 2009
Difference between time measured and time reported by profiler CUDA Programming and Performance	0	944	January 19, 2009
What is GPU&CPU time in profiler? instrumentation overhead included? CUDA Programming and Performance	0	1111	September 19, 2008
overhead between two successive kernel calls CUDA Programming and Performance	6	1745	July 7, 2013
Profiler speeding up my kernels? Nvidia employees please read Weird timing behavior during profiler CUDA Programming and Performance	6	5819	November 9, 2009
CUDA Profiler Cost? How much time is added and where? CUDA Programming and Performance	1	2924	May 7, 2009
getElapsedTime vs Profiler CUDA Programming and Performance	2	393	July 4, 2011
Should we rely on events recording or nvprof values for kernel execution time ? CUDA Programming and Performance	4	729	August 20, 2019

Profiler v. cudaEventSynchronize

Related topics