I have studied about the CUTLASS 2.0, which is the newest BLAS-like templates of NVIDIA.
I seem to have a glimpse of tile dimensions M and N, which is directly related to how to partition the final output matrix. It can affect the performance of GEMM.
But I could not get the point about the tile dimension K. Is it related with the software-pipelining depth of CUTLASS? Or the vectorization of CUTLASS?