Parallel execution of GEMM with other Operations

Here I am using 2 separate CUDA streams, It seems that Just in initialisation of GEMM and also finishing GEMM I have opportunity to run another other kernel in parallel. Why it is like that? Why runtime system does not keep running of other light kernel also? And why just in initialisation we are seeing parallel execution?

Even by reducing grid size of other kernel to “1” I am not seeing parallel execution with GEMM.

The large kernel is GEMM.

Once the GEMM kernel “gets going” it is using all the resources of the GPU, preventing any other kernels from executing concurrently. There is effectively no “space” on the GPU to run other kernels at those points in time.

At the very beginning of the GEMM kernel execution, it has not started using all the resources yet, so there is space for other activity. Likewise at the very end of GEMM kernel execution, the GEMM kernel execution begins releasing resources, opening up “space” on the GPU for another kernel to execute.

You can find many questions on various forums that discuss concurrent kernel execution.

Thanks Robert. So allowing the GEMM to use all resources of GPU and do another operations in sequential way will provide better performance than giving of resources (e.g. 10% or 1 SM) to another stream and kernel and run them in parallel way?

How can I write this argument mathematically and precisely? Something like proving.

Do we have any option for GEMM to limit its dedicated resources?

I didn’t say that. I’m trying to help you explain the behavior you witnessed. The GPU block scheduler chooses work according to its own, unpublished heuristic, and as far as I know there is no requirement or specification that it choose new blocks to schedule from any particular kernel, when multiple kernel launches are outstanding. So your behavior is plausible, and, in my experience, typical. We could presume that the GPU designers are generally desiring to create a machine that delivers high performance in many scenarios, but extrapolating those very general statements to a specific claim like this about a specific, but loosely defined workflow, is not sensible in my opinion.

Not sure what you mean. In any event I can’t make any definitive statements about proving anything about a workload that you haven’t defined, except to say a GEMM kernel and another kernel, with an undefined definition of “performance”. Anyway I don’t think I would be able to respond to further questions about this.

You could investigate:

  1. Stream priorities - this essentially gives you some control over block scheduler behavior
  2. CUDA MPS (That would require you to launch the work in two separate processes, and to use the active thread percentage feature - probably not very convenient.)

I do not know if or how well either of the above mechanisms would work for your case - I have not tried them with your case.

There isn’t anything I know of that pertains to the GEMM function call itself that allows you to arbitrarily limit the resources it will use. You have access to the BLAS reference manual, so I’m pretty sure you can discover this for yourself with a bit of research. You may find something there that serves your purpose that I hadn’t considered.

A BLAS GEMM operation can be decomposed into a set of smaller operations. You could do that manually, yourself, if you wished. In so doing, by reducing each operation to a very small level, you could eventually end up with small kernels that don’t saturate the GPU (albeit having to launch potentially many of them). In such a way you could “indirectly limit” the resources used. Such an approach strikes me as foolishness/madness, but I mention it because it is perhaps another possible way to limit the resource consumption of the BLAS operation.