I have a question regarding the use of WMMA with CUDA API: it doesn’t come with built-in swizzle, right? If that’s the case, reading from global memory to SMEM and then using wmma::load_matrix
would likely cause many bank conflicts when reading from global memory to SMEM.
Does this mean we are expected to read directly from global memory to registers instead of SMEM? If so, wouldn’t this be quite unfriendly for reuse in GEMM operations?
load_matrix_sync
waits until all warp lanes have arrived atload_matrix_sync
and then loads the matrix fragmenta
from memory.mptr
must be a 256-bit aligned pointer pointing to the first element of the matrix in memory.ldm
describes the stride in elements between consecutive rows (for row major layout) or columns (for column major layout) and must be a multiple of 8 for__half
element type or multiple of 4 forfloat
element type (i.e., multiple of 16 bytes in both cases). If the fragment is an accumulator, thelayout
argument must be specified as eithermem_row_major
ormem_col_major
. Formatrix_a
andmatrix_b
fragments, the layout is inferred from the fragment’s layout parameter. The values ofmptr
,ldm
,layout
, and all template parameters fora
must be the same for all threads in the warp. This function must be called by all threads in the warp, or the result is undefined.