How to Map CUTLASS AND CuTe Layouts to Linear Indexes (Hierarchical)

Hi All,

I’m a beginner in CUDA and currently exploring how indexing works in CUTLASS. From what I understand, CUTLASS used a different indexing approach before version 3.0. Starting with version 3.0, it now uses the CuTe layout (Shapes, Strides) to perform indexing across different representations—especially I am trying to understand hierarchical indexes (h-D) to linear indexes. Like mentioned below I could not map h-D index to Linear Index, need help understanding the same

There might be gaps in my understanding, so I’d really appreciate it if you could help clarify or correct me wherever needed. I would also request your help with any references or documentation to help this mapping better

Initial Understanding

I started working throuhg the basic layouts before reaching the Nested (Hierarchical Layouts)

Purpose :- Purpose of different indexes such as flatten layouts & Nested Layouts(Hierarchical Layouts) is to abstract the physical layout, but yet allowing libraries to take advantage of physical layout of GPU to perform computations using the power of parallele computing using execution hierarchy and mapping them to corresponding memory heirarchy.

Execution Hierarchy:- Grids=> Thread Blocks=> Threads
Memory Hierarchy:- Global Memory=> Shared Memory=> Registars

Broadly the layouts are clasiffied into Row Major (Layout Right) & Colum Major ( Layout Left), for this example I am considering Column Major layout

example:- Tensor of size (4,(2,2)) , because we are considering the column major the stride for this would be ( 1, 4, 8)

Stride Calculator for a matrix shap (M, N ,K) => (1, M, MN) * (i, j, k) here (i, j, k) represents indexes.

I am using the below formulae to unflatten & flatten indexes.

Unflatten Second Index:-


(0,0) => "0" => 0%2 =0, 0/2 =0 hence 0 => (0,0) & (0,0) => (0,(0,0))
(0,1) =>  "1" =>  1%2 = 1, 1/2=0 hence 1=> (1,0) & (0,1)=> (0,(1,0))
(0,2) => "2" => 2%2 =0 ,2/2 =1 hence 2=> (0,1) & (0,2) =>  (0,(0,1))
(0,3) => "3" => 3%3 =1, 3/2 =1 hence 3 =>(1,1) ^ (0,3) =>  (0,(1,1))

Flatten Second Index:-


(0,(0,0)) => (0,0) => 1*0 + 0*2 = 0 => (0,0)
(0,(1,0)) => (1,0) => 1*1 + 0*2 = 2 => (0,1)
(0,(0,1)) => (0,1) => 1*0 + 1*2 = 2 => (0,2)
(0,(1,1)) => (1,1) => 1*1 + 1*2 = 3 => (1,1)

For the mentioned tensor below are different ways of mapping

1D 2-D h-D
0 (0,0) (0, (0,0))
1 (1,0) (1,(0,0))
2 (2,0) (2, (0,0))
3 (3,0) (3, (0,0))
4 (0,1) (0,(1,0))
5 (1,1) (1,(1,0))
6 (2,1) (2, (1,0))
7 (3,1) (3,(1.0))
8 (0,2) (0,(0,1))
9 (1,2) (1,(0,1))
10 (2,2) (2,(0,1))
11 (3,2) (3,(0,1))
12 (0,3) (0,(1,1))
13 (1,3) (1,(1,1))
14 (2,3) (2,(1,1))
15 (3,3) (3,(1,1))

I am trying to extend this to hierarchical mapping for the matrix mentioned in below. I am referring to this from the video,trying to understand how this works for the value 49

1D => A[37] = 49

2D => Matrix Shape( 8,8 )=> Stride (1,8) . Hence A(5,4) => 5 + 4*8 = 37 => A[37]

h-D => ((1,2),(0,2)) , I could not trace this back to linear index 37. Below is my understanding . Could you help in how to map this to linear index , also how to visualize (1,2) from the row part and column part (0,2)

matrix size (8,8) is divided into 2 Groups on Rows, 2 Groups on Columns which results in 4 matrices each of (4,4) size with indexes. 1 & 0 from ((1,2),(0,2)) corresponds to (1,0) outer tile mentioned below, which does not have 49 . Hence I could not find a way to map this to linear index

But what I tried to do is to map ((1,2),(0,2)) => (5,4) to do this I tried to flatten it using the strides , not sure if this is correct approach

(2,2,2) the first two indices of the tuple (2,2), applying the same formula to unflatten

5%2 = 1 & 5/2 = 2 => (1,2)
4%2 = 0 & 4/2 = 2 => (0,2)

Outer (Grids):-

(0,0) (0,1)
(1,0) (1,1)

Tile 1 (Top-Left) (0,0)

  • Rows: 0 to 3
  • Columns: 0 to 3

Tile 2 (Top-Right) (0,1)

  • Rows: 0 to 3
  • Columns: 4 to 7

Tile 3 (Bottom-Left)(1,0)

  • Rows: 4 to 7
  • Columns: 0 to 3

Tile 4 (Bottom-Right)(1,1)

  • Rows: 4 to 7
  • Columns: 4 to 7

Inner (Thread Blocks):-

Each Outer Tile is further broken into 2* 2 Matrices resulting 4 2*2 matrices

Threads

Each inner Tile is further broken into 2 groups in rows 2 Groups in columns resulting in 4 (1*1) matrices

Thanks & Regards
Santosh Varada

I suggest moving your question to CUTLASS Github where engineering is more active