Deep dive in concurrent kernel launches

isaaclee2313 · February 2, 2019, 2:14pm

How does a GPU determine whether two kernels “A” and “B” can run concurrently ( they are computationally independent )? i.e. what is a resource(s) that the GPU inspects to allow concurrent kernel launches ( ex. SMs, required shared memory … )? I know this that they don’t look at global memory since what often happens is two kernels end end up trying to allocate more than the available DRAM and a fatal error rises.
For example, let’s assume: process A requires 6 SMs, process B requires 15 SMs, and we only have GTX 1080 which has 20 SMs. However, because process B was very poorly designed, only 10 SMs do computation while the other 5 SMs wait ideally for the other 10 to finish. If we first launch process A attempt to concurrently launch process B, then is our GPU smart enough to realize that there is no computation in the blocks that occupy the 5 blocks, and therefore allow both processes to run concurrently?
for nested kernel launches, if process A tries to launch a process B, how does the GPU determine whether this is viable? is the same schema applied as the answer to question 1 does?

Robert_Crovella · February 2, 2019, 2:57pm

The thought process here is similar to one of occupancy. I suggest you study occupancy and how it determines and limits the number of blocks that can be simultaneously resident on a SM. This sort of capacity consideration is one of the requirements for kernel concurrency. The availability of “room” on a SM for more blocks to be scheduled is one of the factors to determine the possibility of kernel concurrency.
A process can’t “require” a certain number of SMs. That’s not how the CUDA execution model works. A block (that is scheduled on a SM) uses a relatively fixed set of resources (registers per thread times number of threads, static + dynamic shared memory allocated, block slots, warp slots, etc.) regardless of what it is doing or not doing.
For nested kernel launches i.e. CUDA Dynamic Parallelism (CDP), the number of nested launches outstanding adheres to a specific limit. An outstanding launch does not necessarily mean it is executing - i.e. it does not necessarily mean that the GPU block scheduler has scheduled one or more of its blocks on specific SMs. The GPU will go to special lengths to ensure the completion of child kernel launches, so that parent kernels that launched them (and are therefore dependent on their completion) can also complete. This includes the possibility of preemption - the removal of a block executing on a SM to make room for a child kernel block. Preemption is not typical on GPUs but does happen under some circumstances, one of those being CDP. I suggest you read the CDP section in the programming guide.

My own opinions:
In practice, kernel concurrency is hard to witness. It requires a carefully controlled set of conditions which are not typical of efficient CUDA kernel launches. I consider aiming for kernel concurrency to be mostly a misguided idea and a fool’s errand, unless you are well beyond the exploratory stages of CUDA programming, and have the concepts you are asking about mastered. Even then, designing for kernel concurrency only makes sense in certain kinds of work-issuance scenarios.

njuffa · February 2, 2019, 3:44pm

I second those opinions.

isaaclee2313 · February 3, 2019, 4:52am

Thanks for the helpful answers, njuffa and Robert_Crovella.

So the gist seems to be: focus on how to utilize all the GPU’s resources efficiently with a single kernel launch instead of trying to have multiple kernels up at the same time.

Topic		Replies	Views
Grid size limit of concurrent kernels CUDA Programming and Performance	6	628	April 5, 2024
Multiple kernels in flight? CUDA Programming and Performance	19	26829	August 28, 2007
Kernel launch concurrency CUDA Programming and Performance	10	1801	December 11, 2014
Is it possible to schedule multiple kernels on a GPU at any point in time? CUDA Programming and Performance cuda , kernel	1	1505	February 28, 2023
Factors impact latency of two concurrency cuda kernels CUDA Programming and Performance cuda	4	322	November 3, 2023
Concurrent kernel CUDA Programming and Performance	8	1408	January 14, 2024
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	884	July 27, 2017
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17724	April 1, 2010
Threaded CUDA Multiple concurrent kernels? CUDA Programming and Performance	9	5594	October 20, 2009
Max 1 or 2 concurrent kernels per SM? CUDA Programming and Performance	19	11678	May 22, 2014

Deep dive in concurrent kernel launches

Related topics