I have some trouble understanding the CUDA synchronization call. From my personal understanding the nvprof output, the runtime of our GPU programs consists of two parts: GPU kernel runtime and CUDA API runtime and those parts are complementary to each other, and we have
Total Runtime = GPU Activities Runtime + CUDA API Runtime
// Assuming that the application is GPU-intensive.
First question, in common use cases, is this assumption true? (Another way of asking is is that true that, in the nvprof report, the GPU activity of kernel A does not overlap with the CUDA API (especially the synchronization call) of kernel A?)
Imagine that we have a large kernel A and a small kernel B. It is obvious that the GPU kernel time of A will be greater than B. But what about the time for “cudaDeviceSynchronize” calls? Is it always guaranteed that A will spend more time synchronizing compared with B? What factors determine the length of cudaDeviceSynchronize calls?
Suppose that we have the following program:
float * a, b, c; time_T tic, toc, t_A, t_B;
tic = time();
kernel_A <<< ... >>> (a, b, c);
cudaDeviceSynchronize();
toc = time(); t_A = toc - tic;
tic = time()
kernel_B <<< ... >>> (a, b, c);
cudaDeviceSynchronize();
toc = time(); t_B = toc - tic;
Let us assume that kernel_B does the elementwise computation c = a + b and kernel_A does the same thing, except for 10 iterations.
Obviously, from our perspective, kernel_A should take longer time to execute compared with kernel_B (i.e. t_A > t_B). The problem is, why does it take longer to execute kernel_A?
According to the runtime formula given by nvprof, which states that Total Runtime = GPU Kernel Runtime + CUDA API Runtime, there are three possible explanations:
- kernel_A has longer GPU Kernel Runtime.
- kernel_A has longer CUDA API Runtime (i.e. cudaDeviceSynchronize).
- kernel_A is longer in both components.
Second question, which one of the above explanations is right and why?