Strided batched GEMM is far slower when second matrix is transposed

I think its expected behavior in the sense that if I write an “ordinary” CUDA C++/CUBLAS code to do the same operation, I witness the same thing for those matrix sizes (3x3) - roughly the same performance ratio, and the same GPU kernels being called under the hood in each case.

Yes, its far slower. I don’t think calling it “non-batched gemm” is accurate. For your test case of 1000000 matrices, there are only 16 kernel calls being made, regardless of choice of TN, TT, NT, or NN. So each kernel call is performing approximately 65000 matrix-multiply operations, regardless of kernel or transpose settings.

You can file a bug if you wish. One observation I made is that the underlying behavior to some degree depends on matrix size (and perhaps batch size). If I change the matrix size to 1024x1024 for example, all transpose variants call a ampere_sgemm_128x128_XY kernel variant (and the differences in performance largely disappear). This suggests to me that the CUBLAS developers have broken the problem domain into segments based on dimensions, and are using different methods in different segments. It’s possible there is a segment they didn’t “optimize” for one reason or another.

1 Like