Wmma load_mma_sync API

Following the example from Programming Tensor Cores, code here, if I lay out A and B row-major order and run wmma_example (alpha and beta set to 1 for simplicity), I get the right answer for C. However, the kernel specifies A and B as col-major and also indexes A and B in column-major order here.

Also, the leading dimension for A is M and for B is K (lda and ldb), which the documentation states is the stride length between columns for col-major layouts is confusing, as stride-length is 1 between columns for column major by definition?

What am I missing?

The leading dimension in column-major storage is the spacing, in elements, between the first element of column N and the first element of column N+1. Unless the matrix is a 1x1 matrix, it is therefore generally greater than 1.

Note that in BLAS primitives involving matrices (of which GEMM is an important one) leading dimensions are generally used to allow for the easy processing of sub-matrices contained within a larger matrix. In which case, for column-major storage, the spacing between columns of the sub-matrix is based on the column “height” of the larger matrix it is a part of, which is the leading dimension.