Bank Conflicts When Using wmma::load_matrix in CUDA without Swizzle?

I have a question regarding the use of WMMA with CUDA API: it doesn’t come with built-in swizzle, right? If that’s the case, reading from global memory to SMEM and then using wmma::load_matrix would likely cause many bank conflicts when reading from global memory to SMEM.

Does this mean we are expected to read directly from global memory to registers instead of SMEM? If so, wouldn’t this be quite unfriendly for reuse in GEMM operations?

load_matrix_sync waits until all warp lanes have arrived at load_matrix_sync and then loads the matrix fragment a from memory. mptr must be a 256-bit aligned pointer pointing to the first element of the matrix in memory. ldm describes the stride in elements between consecutive rows (for row major layout) or columns (for column major layout) and must be a multiple of 8 for __half element type or multiple of 4 for float element type (i.e., multiple of 16 bytes in both cases). If the fragment is an accumulator, the layout argument must be specified as either mem_row_major or mem_col_major. For matrix_a and matrix_b fragments, the layout is inferred from the fragment’s layout parameter. The values of mptr, ldm, layout, and all template parameters for a must be the same for all threads in the warp. This function must be called by all threads in the warp, or the result is undefined.

1 Like