I was wondering how kernels are actually scheduled and if it were possible for a GPU to have multiple kernels running on it at the same time.
Also when in the case of dynamic parallelism(threads individually calling kernels), how are they scheduled, and would it not blow up? (like if a kernel with 1024 threads called a kernel in its code).
Thanks.
Yes, its possible. Keeping things fairly simple, a GPU kernel will get scheduled (block by block) when there are sufficient resources for it at the SM level, and assuming the block scheduler doesn’t have other work that it chooses to do first. Nothing in that statement precludes blocks from different kernels being scheduled on the same GPU, even on the same SM.
Beyond that, there are numerous writeups of various questions and aspects of GPU block scheduling on various forums. Here is an example Furthermore, CUDA provides a concurrentKernels sample code that allows basic inspection.
Regarding what happens in CDP, to a first order approximation, launching 1024 kernels one from each of 1024 threads on the device is no different than attempting to launch 1024 kernels (perhaps one from each of 1024 threads, if you wish) on the host. The GPU maintains work queues, and kernels go into queues for processing.
There is a difference. On the host side, when the queues are “full”, the kernel launch process switches from asynchronous to synchronous (each launch effectively waiting for a queue slot). So the process becomes self-throttling. AFAIK there is no equivalent “self-throttling process” on the device side. It is up to the device code programmer to make sure they don’t exceed the available queue depth, and CUDA provides both an indication of this as well as an explicit error type. For additional background, I suggest reading the section on CDP in the programming guide. You may specifically wish to pay attention to pending launch limits