When I profile gemm kernels on v100 GPUs using cuBlas kernels, I can see the grid size is larger than one in the z dimension. Here is an example,
gemm shape = 512, 8192, 8192
I am trying to understand what does it mean to have grid.size.z = 3, does that mean tiling is happening in K dimension across different thread blocks?
If so, how is the reduction happening for such thread blocks? I don’t see a reduction kernel which is usually the case when you want to optimize a gemm kernel that has k >> m,n
How can I investigate this further using ncu?
I am running cuBlas gemm on v100 using f16 datatype (tensor core op)