Regarding GEMM Behavior

Hi,
I need some information on the behavior of GEMM. As I see it occupies all shared memory and registers files during execution, but this does not happen immediately, it takes a short time and the GPU can start and run another small kernel during this time.

This is just my observation and I would like to explain it. Could you please give me some technical justification?

GEMM isn’t some unique thing, as if there is only one GEMM implementation, and it is somehow unlike other CUDA codes. So imagining there is only one possible behavior to describe might not make sense.

It might be useful for you to explain the context of your observation. What code were you running? How did you observe the kernel launch behavior?

Having said that, I could make a few general statements. Any non-naive GEMM implementation will probably use both shared memory and registers significantly. But more importantly, a large enough GEMM would probably use all the available warp slots in the SMs.

In that case, the only opportunity for anything else to happen, before the machine has become saturated or “filled”, would be during the time in which the CWD or block scheduler is distributing blocks from the GEMM kernel.

  1. In my experience, this is a very short period of time
  2. To posit that another kernel could begin and run in that time is saying that another kernel is somehow getting priority attention by the CWD. Possible, but generally unspecified (yes, you could use stream priorities. Let’s leave that aside.)

In summary, I am skeptical of your observation for the case where the GEMM is large enough to “fill” the machine. And if you are referring to a small enough GEMM so that it does not fill the machine, then certainly it is possible run something else concurrently.

The wording of your first paragraph makes it seem like you are suggesting a kernel launch sequence where the GEMM kernel gets launched, and it somehow takes some time to “get started” that:

in which case I am skeptical as I described above, for the case where the GEMM kernel is large enough.

If, on the other hand, you mean another kernel that is launched before the GEMM kernel launch, then of course it is possible for that kernel to run. For a period of time it has nothing to compete with, and after the GEMM kernel launch, the behavior of kernel block deposition is subject to the CWD, whose behavior is largely unspecified in this area.