I posted here: Reddit - Dive into anything before realizing that this is better here.

I’m looking at an MLIR Cuda example that loads a 128*64 tile of matrix A, 64 * 128 tile of matrix B, pads them up to 128 * 72 and 64 * 136 (so 8 columns get added to each tile). The warp tile size is 16 * 16. The data types of the input matrices is fp16 so 2 bytes each. So that warp tile of 16 * 16, a warp of 32 threads on each trip gets 8 data points for matrix A and 8 data points for matrix B so needs to make 16 trips to the shared memory? Since each bank is 4 bytes wide, are the tiles of matrix A and matrix B going to get loaded 2 elements per bank? This is too old? Using Shared Memory in CUDA C/C++ | NVIDIA Technical Blog Mine is sm_75 Turing. So my question is, what are all 3 dimensions of my shared memory and how do the tiles get loaded in? And with the above padding scheme are they just adding columsn of 0’s to the very right of the tiles? Or these columns of zeros interleaved strategically?