I’m trying to understand the permuted shared memory layout used in CUTLASS as part of the data pipeline to feed mma.sync tensor core operations, as described in e.g.
My confusion is due to not understanding how the layout used for the store of data loaded from global to shared memory can be different from the layout used to subsequently load the data from shared memory to registers using the LDMATRIX command.
Specifically I understand how the store layout shown here
Without looking to the source code: The only thing one must make sure is that ldmatrix loads the correct data to each thread (the layout and order is not important). The threads can afterwards, before doing any matrix operation on the data, undo the XOR operation by conditionally (depending on the lane id) doing some swaps between the loaded data elements (swaps between thread registers).
I am just wondering, if for performance reasons (assuming FP16 data) one would try to choose a layout, where the correct two FP16 are already together in one 32 bit register. But with the permute instruction, which combines bytes from two operands, it probably would make no performance difference, compared to the select and similar operations.
I figured out what is happening here. The indexing in the “Loading from Shared Memory to Registers” slide is transposed from the indexing in “Load from Global/Store to Shared” slide. In other words in the first slide T0, …, T3 load the first row of the tile from global memory (it’s drawn transposed), whereas in the second slide T0,…, T15 are loading the first column from the tile.
So for example T1 in second slide refers to same element as T4 in first slide, T17 to T5 and so on. This mapping plus the permutation performed when storing the elements in shared memory explains the indexing in the load matrix from shared memory slide.
So @Curefab thanks for your reply, but in fact there is no additional swapping between registers after the LDMATRIX call.
How did you figure it out? The matrix is col-major in global memory, and row-major in shared memory.
When loading from global memory, the first column block T0-T7 is flattened into a row, and the second column block T8-T15 is also flattened but with an permutation of index.
But since each column is a bank, even if we don’t apply permutation, the store to shared memory is also bank-conflict free, isn’t it?
The unpermuted gmem-smem layout
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 8
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
What is the difference between the above and the permuted one below?
Permuted smem layout?
Yes that’s correct the store from global to shared would be bank conflict free without the permuted layout, however the load from shared to registers (the LDMATRIX call) does require the permuted layout to be conflict free.
For example to load the upper left matrix LDMATRIX loads the first 8 128b vectors in the “Logical view of thread block tile”, labelled T0-T7. These correspond to T0, T4, T8, T12, T16, T20, T24, T28 in the “Load from Global / Store to Shared” diagram. Without the permuted layout, T0, T8, T16, T24 would all be in column/bank 0 and T4, T12, T20, T28 would all be in column 4.