About the allocation of resources to possible parallel kernels

Hi,

I had a program with 4 possible kernels (one of them was a large GEMM). I expected to see parallel execution of those kernels, but just in some rare cases they weer runing in parallel. Now by decomposing the large gemm in 3 micro GEMM and one large GEMM, I am seeing that in most cased kernels are runinng in parallel.

I expected to see parallel execution of those kernels, but just in some rare cases, they were running in parallel. Now by decomposing the large GEMMM in 3 micro GEMM and one large GEMM, I am seeing that in most cases kernels are running in parallel.

I have two questions.
First, what is the name of the manager or scheduler, or run-time system that allocates resources to a kernel?

Second, which criteria are behind allocating resources to kernels? Why before that it was rare to see parallel execution and now is in almost cases but with dropping in performance? I want to know the process of allocating resources and assigning it to a kernel.

I would be grateful if you could introduce me some document or papers on this matter.

This system isn’t exposed to the CUDA programmer, and there are no documented controls to provide fine-grained control over it. The criteria used to schedule blocks are not published.

The CUDA block scheduler will deposit a block of its choosing, from amongst those that are waiting, when it finds a SM that has enough available resources (e.g. warp slots, registers, shared memory, etc.) The CUDA programmer doesn’t have any direct control over it.

The key thing to keep in mind here is that the GPU has limited resources. When you launch a large (threads, shared memory, registers, etc.) kernel, it “occupies” the GPU, and the reason you don’t see overlapping execution is that there is no “space” for additional work.

Breaking a large kernel up into pieces, without other considerations, is usually not a strategy to increase performance.