Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch


This is an old topic, about kernel launch time. But it has been haunting me for a long time while I try to optimize CUDA code to get the best performance.

When I profile my code, NVIDIA viusal profiler is a very useful tool. However, the discussion/help on “GPU and CPU time” reported in the profiler is very brief. Often, you get a very small GPU time and very large CPU time for a kernel call. In the profiler help, it says that the CPU time is associated with kernel launch time. How to optimize the code to minimize this CPU time?

Given an example, let’s use visual profiler on “SimpleGL” example application in CUDA SDK. You will find the CPU time is ranging from 600 to 1200 for each kernel call (on my GTX 280 with CUDA 3.2, Win7). The GPU time is always 27.8. So the GPU execution is ready fast, but CPU time for kernel launch is very large.

Some people may think it is because Visual Profiler uses a block mode to launch the kernel. Actually when I use CUDA timer in my code to record the performance (using non-blocking mode), I got the similar results. (I can provide my code if you like to see it.)

Another confusing finding is that when I call three kernels during a loop, it’s already the first kernel has a significant CPU time. The other two are doing fine.

So, my questions are:

  1. What is really happening during this CPU time period?
  2. How to optimize CUDA code to minimize this CPU time?

Hope I can get some constructive suggestions. Thank you!