What does it mean that the grid size in the z dimension is more than one in cuBlas gemms?


When I profile gemm kernels on v100 GPUs using cuBlas kernels, I can see the grid size is larger than one in the z dimension. Here is an example,

gemm shape = 512, 8192, 8192

I am trying to understand what does it mean to have grid.size.z = 3, does that mean tiling is happening in K dimension across different thread blocks?
If so, how is the reduction happening for such thread blocks? I don’t see a reduction kernel which is usually the case when you want to optimize a gemm kernel that has k >> m,n

How can I investigate this further using ncu?

I am running cuBlas gemm on v100 using f16 datatype (tensor core op)


The internal decomposition done by cublas is specific to the way the the library is implemented. How the reduction is done etc… is an implementation detail of the library. In general, cuda grids can be 1,2,or 3 dimensional. The choice is based on what offers the best calculation for your indices. They may be able to provide details of the implementation or reduction on the cublas forum GPU-Accelerated Libraries - NVIDIA Developer Forums but I can’t be sure.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.