Confused about (a,b)layout on mma.sync instructions

leclerc1 · March 20, 2024, 1:44pm

Hello,

I was wondering about the .alayout and .blayout modifiers on mma instructions. The layout seems to be very strict and described in sections 9.7.13.4.[1-13] of the PTX ISA specification. Could I get some information on what would changing the layout would do ? I’ve only ever seen .row.col

In addition are there examples of full programs that use the .mma instructions just to double check that I parsed the specification properly ?

Thanks a lot

Robert_Crovella · March 20, 2024, 3:31pm

I don’t think there is a repository of sample codes that use/demonstrate each and every variant of each PTX mma instruction. CUTLASS tends to use them to some degree, and with a bit of searching you can find a few examples here on these forums.

The .row and .col refer to row-major and column-major indexing/storage order.

From here:

The qualifiers .alayout and .blayout must match the layout specified on the wmma.load instructions that produce the contents of operands a and b respectively

Referring to the referenced load instructions:

The .layout qualifier indicates whether the matrix to be loaded is stored in row-major or column-major format.

Changing the .layout on the load instruction would change the expected memory storage order that the instruction would use as it is populating registers. And the .alayout etc. must match what was used in the preceding load instruction.

leclerc1 · March 20, 2024, 3:43pm

Thanks for your answer! I guess what I’m mostly confused is that to me, row-major and column-major make sense to describe how a matrix is stored in a contiguous memory region. For those instructions, the inputs are expected to be in registers spread over the threads in the warp and which register contain what part of the matrix is strictly specified in the sections above.

For it to make sense to me, there should be a duplicated section in the spec for each section 9.7.13.4.X, one that for row and one for column major.

Put it another way: if I follow the exact layout described in 9.7.13.4.8 for example, and use mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32. Where is the layout I should use if I were to call mma.sync.aligned.m16n8k16.col.col.f32.f16.f16.f32 instead ?

Robert_Crovella · March 20, 2024, 3:57pm

For the wmma instructions I think the information makes sense. For the mma instructions I can’t really explain why it would be needed.

The mma instructions have a visible register layout. The wmma instructions do not. The mma instructions have this:

specified explicitly, the wmma instructions do not.

The load instruction I referred to (which the .alayout desicription refers back to), pulls data from memory into an opaque register struction. So memory storage order is clearly important.

As to why the wmma instruction also needs this knowledge, I cannot say exactly, but its entirely possible that the opaque register storage structure changes depending on the specified memory storage order, and thefore the wmma.mma instruction would need to know that.

leclerc1 · March 20, 2024, 4:03pm

I understand that wmma doesn’t have a visible layout and I’m not talking about those here.

I also understand why a load (ldmatrix operation) would need the .alayout and .blayout modifiers to properly place the elements in registers.

What I don’t understand is why the multiplication part (mma operation), which operates from registers that I could have loaded or computed myself, needs the alayout and blayout.

More concretely

.reg .b32 %Ra, %Rb, %Rc<2>, %Rd<2>;

//  do calculations to fill registers allocated above (eg generage them randomly or init from constants)

mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.u8.s32
  {%Rd0, %Rd1},
  {%Ra},
  {%Rb},
  {%Rc0, %Rc1};

What should change in this code if I swaped row for col ?
There is no load here and no contiguous memory location so row or column major order doesn’t quite makes sense to me

Robert_Crovella · March 20, 2024, 8:45pm

When I attempt to change either .row to .col or .col to .row in the complete code from here, I get a PTX error, basically saying that is an illegal choice or an illegal instruction.

ptxas /tmp/tmpxft_00040ec9_00000000-6_t153.ptx, line 105; error   : Illegal alayout '.col' for instruction 'mma'`

So the question seems moot to me.

leclerc1 · March 20, 2024, 8:50pm

I’m trying to understand the specification itself, which mentions two modifiers .row and .col which behavior/implication isn’t well defined (and in the current implementation illegal to change as you seem to suggest), so I wouldn’t say the question is moot. If what you are saying is correct I think the document would needs an update.

Moreover, I think it’s ok if the implementation doesn’t support a feature of the language itself (like for wmma, only one order is supported). However, here, I’m not even clear what those would do in an ideal implementation of the specification as I don’t understand the meaning of row/major order for a register layout.

Robert_Crovella · March 20, 2024, 8:59pm

Anyone can file a bug at any time to request document clarifications.

Curefab · March 23, 2024, 10:51pm

Exchanging row and col would access the data and combine matrices in a matrix multiplication in a totally different way. That is not how the architecture of SMs are made.
I for my part do not understand mma in terms of rows and cols, but in terms of dimension which is added over and (two) independent dimensions, which are just calculated in parallel.

Topic		Replies	Views
How to use mma instructions to mul matrix in col major CUDA Programming and Performance	6	279	December 7, 2024
Does wmma::load_matrix_sync() or wmma::mma_sync() performance benefit from row-major or col-major layouts? CUDA Programming and Performance cuda	1	108	November 15, 2025
Understand the mma instruction in PTX CUDA Programming and Performance	5	1189	June 12, 2024
What does the "row" and "col" mean in mma.sync.aligned.m16n8k4.row.col.f32.tf32.tf32.f32 CUDA Programming and Performance	0	268	October 30, 2023
Address out of bounds when using mma instructions CUDA Programming and Performance cuda	8	281	June 11, 2024
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	202	July 15, 2024
Fastest Tiled WMMA for Matrices of Any Size? CUDA Programming and Performance	3	393	October 26, 2024
Clarification on the accumulator layout in an mma instruction CUDA Programming and Performance	2	343	November 30, 2023
Alignment requirement for the `ldmatrix` instruction CUDA Programming and Performance cuda , instruction	3	216	November 1, 2024
The fragment layout of multiplicand A is not clear in mma.sp.sync.aligned.m16n8k32 when data type is fp16/bf16 CUDA Programming and Performance	2	699	February 22, 2022

Confused about (a,b)layout on mma.sync instructions

Related topics