No, thatâ€™s not correct. in each case (for A and for B) we are loading a square tile into a shared memory array. The loading of this tile does not determine the memory storage order. The tile is simply a â€śsnapshotâ€ť or copy of a particular section of the underlying matrix, in the same order. There is some indication of the expected memory storage order in the actual multiplication loop:

Since the reference into A_s there is effectively selecting a row of A_s (the column varies by the loop index, but the row is constant across the loop) and the reference into B_s is effectively selecting a column of B_s, we can expect that this is normal rowxcolumn vector dot product, and therefore is consistent with both matrices A and B being stored in row-major order.

The reason there is a variation in indexing in the loading of A_s and B_s is because the selection of tiles to be loaded moves horizontally across A and vertically down B. This is to facilitate the idea that in order to compute a complete vector dot product of one row of A by one column of B, we need the whole row in A and the whole column in B. Therefore the selected tile for A_s is chosen horizontally across A (following the row direction) and for B_s vertically down B (following the column direction).