cudaLaunch overheads

Hello! I’m new in CUDA. And I need some help!

I’m profiling my CUDA program, using nvprof. (my program is using openmp multithreads)

I’m seeing that cudaLaunch takes 280 ns (min) and 58ms (max).

Can you explain what makes the difference between the minimum time and the maximum time?

How can I reduce the cudaLaunch time?


(1) What’s the operating system?
(2) What’s the GPU?
(3) If Windows, WDDM driver or TCC driver?

The minimal overhead of a kernel launch is around 5 usec, and achievable on Linux and Windows with TCC driver. With Windows and a WDDM driver (default), launches are batched to mitigate the massive overhead imposed by the WDDM driver model. That causes significant fluctuations in overhead and can drive launch overhead for some launches to 50 usec or so.

Launch times of 58 milliseconds would seem to indicate a saturation of the launch queue (which is quite deep). At least that’s the best idea I have right now, without access to your data.

Minimize launch overhead:

(1) Use Linux, or Windows with TCC driver (only some GPUs are supported!)
(2) Use a CPU with high single-thread performance, as the software portion of the launch overhead is serial CPU work. CPUs with base frequency >= 3.5 GHz will work well.

(1) operating system is linux
(2) GPU is GeForce GTX 1080

I think it’s inevitable, because single-thread program takes more overheads.

Thanks for your reply.