Understanding CUTLASS Permuted Shared Memory layout

sam.x.patterson · August 16, 2024, 11:14pm

I’m trying to understand the permuted shared memory layout used in CUTLASS as part of the data pipeline to feed mma.sync tensor core operations, as described in e.g.

github.com

NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md#shared-memory-layouts

![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Implicit GEMM API")

[README](../../README.md#documentation) > **Implicit GEMM Convolution**

# CUTLASS Convolution

Implicit GEMM is the formulation of a convolution operation as a GEMM (generalized matrix-matrix
product). Convolution takes an activation tensor and applies a sliding filter on it to produce an
output tensor. 

## Introduction

This release of CUTLASS contains several artifacts related to convolution.

- [**Implicit GEMM Algorithm**](implicit_gemm_convolution.md#implicit-gemm-algorithm)
- [**CUTLASS Convolution Implementation**](implicit_gemm_convolution.md#cutlass-convolution-implementation)
- [**Convolution Examples**](implicit_gemm_convolution.md#convolution-example)


# Implicit GEMM Algorithm

This file has been truncated. show original

My confusion is due to not understanding how the layout used for the store of data loaded from global to shared memory can be different from the layout used to subsequently load the data from shared memory to registers using the LDMATRIX command.

Specifically I understand how the store layout shown here

avoids bank conflicts, and how the bank index is computed using int store_column = (lane_id % 8) ^ (lane_id / 8);

But I don’t understand how the layout for load shown here

Can be different from the store layout without loading the wrong data, or how the bank indices for this layout are computed.

I am missing some fundamental point here, and any help would be appreciated.

Curefab · August 17, 2024, 7:54am

Without looking to the source code: The only thing one must make sure is that ldmatrix loads the correct data to each thread (the layout and order is not important). The threads can afterwards, before doing any matrix operation on the data, undo the XOR operation by conditionally (depending on the lane id) doing some swaps between the loaded data elements (swaps between thread registers).

I am just wondering, if for performance reasons (assuming FP16 data) one would try to choose a layout, where the correct two FP16 are already together in one 32 bit register. But with the permute instruction, which combines bytes from two operands, it probably would make no performance difference, compared to the select and similar operations.

sam.x.patterson · August 18, 2024, 7:01pm

I figured out what is happening here. The indexing in the “Loading from Shared Memory to Registers” slide is transposed from the indexing in “Load from Global/Store to Shared” slide. In other words in the first slide T0, …, T3 load the first row of the tile from global memory (it’s drawn transposed), whereas in the second slide T0,…, T15 are loading the first column from the tile.

So for example T1 in second slide refers to same element as T4 in first slide, T17 to T5 and so on. This mapping plus the permutation performed when storing the elements in shared memory explains the indexing in the load matrix from shared memory slide.

So @Curefab thanks for your reply, but in fact there is no additional swapping between registers after the LDMATRIX call.

out.reacher · August 26, 2024, 3:19am

How did you figure it out? The matrix is col-major in global memory, and row-major in shared memory.
When loading from global memory, the first column block T0-T7 is flattened into a row, and the second column block T8-T15 is also flattened but with an permutation of index.

But since each column is a bank, even if we don’t apply permutation, the store to shared memory is also bank-conflict free, isn’t it?

The unpermuted gmem-smem layout

Bank 1	Bank 2	Bank 3	Bank 4	Bank 5	Bank 6	Bank 7	Bank 8
T0	T1	T2	T3	T4	T5	T6	T7
T8	T9	T10	T11	T12	T13	T14	T15

What is the difference between the above and the permuted one below?
Permuted smem layout?

Bank 1	Bank 2	Bank 3	Bank 4	Bank 5	Bank 6	Bank 7	Bank 8
T0	T1	T2	T3	T4	T5	T6	T7
T9	T8	T11	T10	T13	T12	T15	T14

sam.x.patterson · August 29, 2024, 2:55pm

Yes that’s correct the store from global to shared would be bank conflict free without the permuted layout, however the load from shared to registers (the LDMATRIX call) does require the permuted layout to be conflict free.

For example to load the upper left matrix LDMATRIX loads the first 8 128b vectors in the “Logical view of thread block tile”, labelled T0-T7. These correspond to T0, T4, T8, T12, T16, T20, T24, T28 in the “Load from Global / Store to Shared” diagram. Without the permuted layout, T0, T8, T16, T24 would all be in column/bank 0 and T4, T12, T20, T28 would all be in column 4.

out.reacher · August 29, 2024, 3:40pm

Thanks for your clarification, I understand now.

system · September 12, 2024, 3:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	13	1917	September 9, 2024
Shared memory matrix multiplication not working CUDA Programming and Performance	6	58	October 11, 2024
Performance of diagonal access to distributed shared memory CUDA Programming and Performance	3	899	February 23, 2024
Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA CUDA Programming and Performance	4	1179	June 18, 2023
Texture Memory / Large Data / Global Memory Advice CUDA Programming and Performance	14	10706	May 18, 2010
passing different sizes arrays data to shared variables CUDA Programming and Performance	9	1621	May 4, 2010
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1159	April 26, 2013
Question about Pinned memory CUDA Programming and Performance	8	1895	June 16, 2016
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007
GEMM Optimization: Achieving Coalesced and Bank Conflict-Free Shared Memory Access CUDA Programming and Performance cuda	3	131	January 9, 2025

Understanding CUTLASS Permuted Shared Memory layout

Related topics