What will happen when CPU keeps launching kernels without waiting, overloading the GPU?

I have a use case where all kernels will aggregate results to the same memory address. In this case the CPU keeps launching kernels without waiting, until all kernels are finished, then it reads the result address.

In this manner, since kernel launch is non-blocking, the CPU will inevitably overload the GPU, what is the expected consequence on the GPU & CPU side?

What I currently observe is that CPU seems to be somewhat busy, and the overall time is lengthened.

Thanks for the help!

The command queue into which the host emits kernel launches has a finite length. If the queue is full, a kernel launch becomes blocking, as the host has to wait for space in the queue to free up so the next kernel launch can be placed into the queue.

In the past, people have reported in these forums on experiments they created to determine the depth of the GPU command queue. I seem to recall that they observed a depth on the order of 1000 kernel launches. Experiments over the years have shown that the maximum launch rate of null kernels holds steady at around 200,000 per second across multiple GPU generations, i.e. this is the maximum rate at which the GPU can drain the command queue.

The GPU command queue can, like any finite-sized buffer, only absorb a temporary imbalance between the producing and consuming sides. Generally one would want to strive for kernel execution times in the millisecond rather than microsecond range.

1 Like

Thanks! Very accurate description.