What determines the amount of time spent on my `cudaSynchronize` call?

bojian.zheng · February 21, 2019, 10:51pm

I have some trouble understanding the CUDA synchronization call. From my personal understanding the nvprof output, the runtime of our GPU programs consists of two parts: GPU kernel runtime and CUDA API runtime and those parts are complementary to each other, and we have

Total Runtime = GPU Activities Runtime + CUDA API Runtime
// Assuming that the application is GPU-intensive.

First question, in common use cases, is this assumption true? (Another way of asking is is that true that, in the nvprof report, the GPU activity of kernel A does not overlap with the CUDA API (especially the synchronization call) of kernel A?)

Imagine that we have a large kernel A and a small kernel B. It is obvious that the GPU kernel time of A will be greater than B. But what about the time for “cudaDeviceSynchronize” calls? Is it always guaranteed that A will spend more time synchronizing compared with B? What factors determine the length of cudaDeviceSynchronize calls?

Suppose that we have the following program:

float * a, b, c;  time_T tic, toc, t_A, t_B;

tic = time();
kernel_A <<< ... >>> (a, b, c);
cudaDeviceSynchronize();
toc = time(); t_A = toc - tic;

tic = time()
kernel_B <<< ... >>> (a, b, c);
cudaDeviceSynchronize();
toc = time(); t_B = toc - tic;

Let us assume that kernel_B does the elementwise computation c = a + b and kernel_A does the same thing, except for 10 iterations.

Obviously, from our perspective, kernel_A should take longer time to execute compared with kernel_B (i.e. t_A > t_B). The problem is, why does it take longer to execute kernel_A?

According to the runtime formula given by nvprof, which states that Total Runtime = GPU Kernel Runtime + CUDA API Runtime, there are three possible explanations:

kernel_A has longer GPU Kernel Runtime.
kernel_A has longer CUDA API Runtime (i.e. cudaDeviceSynchronize).
kernel_A is longer in both components.

Second question, which one of the above explanations is right and why?

njuffa · February 21, 2019, 11:33pm

This seems to be a contradiction it itself. Your premise is that kernel A does ten times the amount of kernel B, so “obviously it should take longer” than kernel B, as you state. So how can it now be a “problem” when “kernel A takes longer to execute”? That’s what you expected to begin with.

If we define cudaDeviceSynchronize() overhead as the time taken between the GPU going idle at the end of a kernel’s execution to the time control returns to the caller of cudaDeviceSynchronize(), that overhead should be approximately the same regardless of the kind of kernel that was running.

There may be some variation in this overhead because the host-side code and data involved in the synchronization may be inside or outside caches, or there maybe some contention in the PCIe subsystem that returns the “all idle” notification from the GPU. And the overall delay may differ somewhat based on the general speed of the host system and the configuration of the PCIe link.

Note that your timing methodology is flawed if your intention is to isolate the kernel run time as seen by the host. The GPU may be busy at the time you invoke the kernel, so the kernel might not be able to run for quite some time, yet you already start the timer. You would want something like this:

cudaDeviceSynchronize();         // make sure all previous GPU work has finished
tic = time();                    // start the timer
kernel_A <<< ... >>> (a, b, c);  // run the kernel
cudaDeviceSynchronize();         // make sure kernel is completely done
toc = time(); t_A = toc - tic;   // stop timer and compute elapsed wall clock time for executing kernel

The elapsed time you measure this way includes cudaDeviceSynchronize() overhead, which you may want to remove by calibration, i.e. running the same framework without invoking a kernel. To eliminate cold-start effects, you would want to stick everything into a loop and repeat a few times. E.g.

#define NUM_REPEATS (3)
for (int i = 0; i < NUM_REPEATS; i++) {
   cudaDeviceSynchronize();            
   tic = time();                       
   cudaDeviceSynchronize();            
   toc = time(); overhead = toc - tic;

   cudaDeviceSynchronize();       
   tic = time();                    
   kernel_A <<< ... >>> (a, b, c); 
   cudaDeviceSynchronize();         
   toc = time(); elapsed = (toc - tic) - overhead;   
}
printf ("kernel_A executes in %lld ticktocks\", elapsed);

Topic		Replies	Views
Why is cudaThreadSynchronize() so expensive? CUDA Programming and Performance	7	2075	October 21, 2010
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2002	July 30, 2010
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1435	July 17, 2017
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7735	December 5, 2008
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23197	October 7, 2010
cudaDeviceSynchronize doesn't work if the kernel function takes too long to complete CUDA Programming and Performance	3	8613	January 29, 2012
is cudaThreadSynchronize() will take 600+ms to execute? CUDA Programming and Performance	3	1538	April 21, 2009
CUDA OpenCL comparison CUDA Programming and Performance	9	3394	August 23, 2011
A few new to CUDA questions CUDA Programming and Performance	3	1111	February 4, 2011

What determines the amount of time spent on my `cudaSynchronize` call?

Related topics