cost of launching a new grid

What is the cost of launching a new grid ( and what are the factors that determine the cost? )? is it significant?

The minimal launch overhead for a null-kernel (kernel with no arguments performing no work) is about 5 microseconds. This means you will not be able to launch more than 200,000 kernels per second at best. If you use a Windows platform with a WDDM driver, you will likely observe launch latency that varies widely and is also higher on average (say, 10-15 microseconds). This is due to the high overhead inherent in WDDM (a driver model defined by Microsoft).

There is a small amount of additional software overhead added on top of the 5 microseconds based on the complexity of the kernel launch. Basically the CUDA run-time and the driver need to translate the launch configuration and kernel arguments from the application into commands for the GPU that get stuffed into a command buffer. This is host-side overhead, and the overhead is primarily a function of single-thread CPU performance, which is why CPUs with a high base clock are useful for minimizing the overhead from this serial code.

By observation, kernel launches from device code (“dynamic parallelism”) are no faster than kernel launches from host code.

Wow, that is a lot of substantial detail. Thanks so much!

Top of Google for ‘cuda when to launch a new grid’, and exactly the information I wanted. Thanks!