What does it mean that the grid size in the z dimension is more than one in cuBlas gemms?


When I profile gemm kernels on v100 GPUs using cuBlas kernels, I can see the grid size is larger than one in the z dimension. Here is an example,

gemm shape = 512, 8192, 8192

I am trying to understand what does it mean to have grid.size.z = 3, does that mean tiling is happening in K dimension across different thread blocks?
If so, how is the reduction happening for such thread blocks? I don’t see a reduction kernel which is usually the case when you want to optimize a gemm kernel that has k >> m,n

How can I investigate this further using ncu?

I am running cuBlas gemm on v100 using f16 datatype (tensor core op)


1 Like

I also have the same question, do you understand the reason now?

two main reasons as per my digging:

  1. K is tiled across different threadblocks, then some of them do the epilogue that includes k-dim reduction.
  2. batched gemms are run so the z dim in grid size belongs to the batch size.

Let me know if these make sense to you.

1 Like