Understanding CUTLASS Permuted Shared Memory layout

I’m trying to understand the permuted shared memory layout used in CUTLASS as part of the data pipeline to feed mma.sync tensor core operations, as described in e.g.

My confusion is due to not understanding how the layout used for the store of data loaded from global to shared memory can be different from the layout used to subsequently load the data from shared memory to registers using the LDMATRIX command.

Specifically I understand how the store layout shown here


avoids bank conflicts, and how the bank index is computed using int store_column = (lane_id % 8) ^ (lane_id / 8);

But I don’t understand how the layout for load shown here


Can be different from the store layout without loading the wrong data, or how the bank indices for this layout are computed.

I am missing some fundamental point here, and any help would be appreciated.

Without looking to the source code: The only thing one must make sure is that ldmatrix loads the correct data to each thread (the layout and order is not important). The threads can afterwards, before doing any matrix operation on the data, undo the XOR operation by conditionally (depending on the lane id) doing some swaps between the loaded data elements (swaps between thread registers).

I am just wondering, if for performance reasons (assuming FP16 data) one would try to choose a layout, where the correct two FP16 are already together in one 32 bit register. But with the permute instruction, which combines bytes from two operands, it probably would make no performance difference, compared to the select and similar operations.

I figured out what is happening here. The indexing in the “Loading from Shared Memory to Registers” slide is transposed from the indexing in “Load from Global/Store to Shared” slide. In other words in the first slide T0, …, T3 load the first row of the tile from global memory (it’s drawn transposed), whereas in the second slide T0,…, T15 are loading the first column from the tile.

So for example T1 in second slide refers to same element as T4 in first slide, T17 to T5 and so on. This mapping plus the permutation performed when storing the elements in shared memory explains the indexing in the load matrix from shared memory slide.

So @Curefab thanks for your reply, but in fact there is no additional swapping between registers after the LDMATRIX call.

1 Like

How did you figure it out? The matrix is col-major in global memory, and row-major in shared memory.
When loading from global memory, the first column block T0-T7 is flattened into a row, and the second column block T8-T15 is also flattened but with an permutation of index.

But since each column is a bank, even if we don’t apply permutation, the store to shared memory is also bank-conflict free, isn’t it?

The unpermuted gmem-smem layout

Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 8
T0 T1 T2 T3 T4 T5 T6 T7
T8 T9 T10 T11 T12 T13 T14 T15

What is the difference between the above and the permuted one below?
Permuted smem layout?

Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 8
T0 T1 T2 T3 T4 T5 T6 T7
T9 T8 T11 T10 T13 T12 T15 T14

Yes that’s correct the store from global to shared would be bank conflict free without the permuted layout, however the load from shared to registers (the LDMATRIX call) does require the permuted layout to be conflict free.

For example to load the upper left matrix LDMATRIX loads the first 8 128b vectors in the “Logical view of thread block tile”, labelled T0-T7. These correspond to T0, T4, T8, T12, T16, T20, T24, T28 in the “Load from Global / Store to Shared” diagram. Without the permuted layout, T0, T8, T16, T24 would all be in column/bank 0 and T4, T12, T20, T28 would all be in column 4.

Thanks for your clarification, I understand now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.