What does it mean that the grid size in the z dimension is more than one in cuBlas gemms?

mustafaali · July 13, 2023, 9:36pm

Hi,

When I profile gemm kernels on v100 GPUs using cuBlas kernels, I can see the grid size is larger than one in the z dimension. Here is an example,

gemm shape = 512, 8192, 8192

I am trying to understand what does it mean to have grid.size.z = 3, does that mean tiling is happening in K dimension across different thread blocks?
If so, how is the reduction happening for such thread blocks? I don’t see a reduction kernel which is usually the case when you want to optimize a gemm kernel that has k >> m,n

How can I investigate this further using ncu?

I am running cuBlas gemm on v100 using f16 datatype (tensor core op)

Thanks,
Mustafa.

hjonz · July 18, 2023, 4:22am

I also have the same question, do you understand the reason now?

mustafaali · August 24, 2023, 11:26pm

two main reasons as per my digging:

K is tiled across different threadblocks, then some of them do the epilogue that includes k-dim reduction.
batched gemms are run so the z dim in grid size belongs to the batch size.

Let me know if these make sense to you.