What determines the amount of time spent on my `cudaSynchronize` call?

I have some trouble understanding the CUDA synchronization call. From my personal understanding the nvprof output, the runtime of our GPU programs consists of two parts: GPU kernel runtime and CUDA API runtime and those parts are complementary to each other, and we have

Total Runtime = GPU Activities Runtime + CUDA API Runtime
// Assuming that the application is GPU-intensive.

First question, in common use cases, is this assumption true? (Another way of asking is is that true that, in the nvprof report, the GPU activity of kernel A does not overlap with the CUDA API (especially the synchronization call) of kernel A?)

Imagine that we have a large kernel A and a small kernel B. It is obvious that the GPU kernel time of A will be greater than B. But what about the time for “cudaDeviceSynchronize” calls? Is it always guaranteed that A will spend more time synchronizing compared with B? What factors determine the length of cudaDeviceSynchronize calls?

Suppose that we have the following program:

float * a, b, c;  time_T tic, toc, t_A, t_B;

tic = time();
kernel_A <<< ... >>> (a, b, c);
toc = time(); t_A = toc - tic;

tic = time()
kernel_B <<< ... >>> (a, b, c);
toc = time(); t_B = toc - tic;

Let us assume that kernel_B does the elementwise computation c = a + b and kernel_A does the same thing, except for 10 iterations.

Obviously, from our perspective, kernel_A should take longer time to execute compared with kernel_B (i.e. t_A > t_B). The problem is, why does it take longer to execute kernel_A?

According to the runtime formula given by nvprof, which states that Total Runtime = GPU Kernel Runtime + CUDA API Runtime, there are three possible explanations:

  • kernel_A has longer GPU Kernel Runtime.
  • kernel_A has longer CUDA API Runtime (i.e. cudaDeviceSynchronize).
  • kernel_A is longer in both components.

Second question, which one of the above explanations is right and why?

This seems to be a contradiction it itself. Your premise is that kernel A does ten times the amount of kernel B, so “obviously it should take longer” than kernel B, as you state. So how can it now be a “problem” when “kernel A takes longer to execute”? That’s what you expected to begin with.

If we define cudaDeviceSynchronize() overhead as the time taken between the GPU going idle at the end of a kernel’s execution to the time control returns to the caller of cudaDeviceSynchronize(), that overhead should be approximately the same regardless of the kind of kernel that was running.

There may be some variation in this overhead because the host-side code and data involved in the synchronization may be inside or outside caches, or there maybe some contention in the PCIe subsystem that returns the “all idle” notification from the GPU. And the overall delay may differ somewhat based on the general speed of the host system and the configuration of the PCIe link.

Note that your timing methodology is flawed if your intention is to isolate the kernel run time as seen by the host. The GPU may be busy at the time you invoke the kernel, so the kernel might not be able to run for quite some time, yet you already start the timer. You would want something like this:

cudaDeviceSynchronize();         // make sure all previous GPU work has finished
tic = time();                    // start the timer
kernel_A <<< ... >>> (a, b, c);  // run the kernel
cudaDeviceSynchronize();         // make sure kernel is completely done
toc = time(); t_A = toc - tic;   // stop timer and compute elapsed wall clock time for executing kernel

The elapsed time you measure this way includes cudaDeviceSynchronize() overhead, which you may want to remove by calibration, i.e. running the same framework without invoking a kernel. To eliminate cold-start effects, you would want to stick everything into a loop and repeat a few times. E.g.

#define NUM_REPEATS (3)
for (int i = 0; i < NUM_REPEATS; i++) {
   tic = time();                       
   toc = time(); overhead = toc - tic;

   tic = time();                    
   kernel_A <<< ... >>> (a, b, c); 
   toc = time(); elapsed = (toc - tic) - overhead;   
printf ("kernel_A executes in %lld ticktocks\", elapsed);