Confused about (a,b)layout on mma.sync instructions


I was wondering about the .alayout and .blayout modifiers on mma instructions. The layout seems to be very strict and described in sections[1-13] of the PTX ISA specification. Could I get some information on what would changing the layout would do ? I’ve only ever seen .row.col

In addition are there examples of full programs that use the .mma instructions just to double check that I parsed the specification properly ?

Thanks a lot

I don’t think there is a repository of sample codes that use/demonstrate each and every variant of each PTX mma instruction. CUTLASS tends to use them to some degree, and with a bit of searching you can find a few examples here on these forums.

The .row and .col refer to row-major and column-major indexing/storage order.

From here:

The qualifiers .alayout and .blayout must match the layout specified on the wmma.load instructions that produce the contents of operands a and b respectively

Referring to the referenced load instructions:

The .layout qualifier indicates whether the matrix to be loaded is stored in row-major or column-major format.

Changing the .layout on the load instruction would change the expected memory storage order that the instruction would use as it is populating registers. And the .alayout etc. must match what was used in the preceding load instruction.

Thanks for your answer! I guess what I’m mostly confused is that to me, row-major and column-major make sense to describe how a matrix is stored in a contiguous memory region. For those instructions, the inputs are expected to be in registers spread over the threads in the warp and which register contain what part of the matrix is strictly specified in the sections above.

For it to make sense to me, there should be a duplicated section in the spec for each section, one that for row and one for column major.

Put it another way: if I follow the exact layout described in for example, and use mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32. Where is the layout I should use if I were to call mma.sync.aligned.m16n8k16.col.col.f32.f16.f16.f32 instead ?

For the wmma instructions I think the information makes sense. For the mma instructions I can’t really explain why it would be needed.

The mma instructions have a visible register layout. The wmma instructions do not. The mma instructions have this:

specified explicitly, the wmma instructions do not.

The load instruction I referred to (which the .alayout desicription refers back to), pulls data from memory into an opaque register struction. So memory storage order is clearly important.

As to why the wmma instruction also needs this knowledge, I cannot say exactly, but its entirely possible that the opaque register storage structure changes depending on the specified memory storage order, and thefore the instruction would need to know that.

I understand that wmma doesn’t have a visible layout and I’m not talking about those here.

I also understand why a load (ldmatrix operation) would need the .alayout and .blayout modifiers to properly place the elements in registers.

What I don’t understand is why the multiplication part (mma operation), which operates from registers that I could have loaded or computed myself, needs the alayout and blayout.

More concretely

.reg .b32 %Ra, %Rb, %Rc<2>, %Rd<2>;

//  do calculations to fill registers allocated above (eg generage them randomly or init from constants)

  {%Rd0, %Rd1},
  {%Rc0, %Rc1};

What should change in this code if I swaped row for col ?
There is no load here and no contiguous memory location so row or column major order doesn’t quite makes sense to me

When I attempt to change either .row to .col or .col to .row in the complete code from here, I get a PTX error, basically saying that is an illegal choice or an illegal instruction.

ptxas /tmp/tmpxft_00040ec9_00000000-6_t153.ptx, line 105; error   : Illegal alayout '.col' for instruction 'mma'`

So the question seems moot to me.

I’m trying to understand the specification itself, which mentions two modifiers .row and .col which behavior/implication isn’t well defined (and in the current implementation illegal to change as you seem to suggest), so I wouldn’t say the question is moot. If what you are saying is correct I think the document would needs an update.

Moreover, I think it’s ok if the implementation doesn’t support a feature of the language itself (like for wmma, only one order is supported). However, here, I’m not even clear what those would do in an ideal implementation of the specification as I don’t understand the meaning of row/major order for a register layout.

Anyone can file a bug at any time to request document clarifications.

Exchanging row and col would access the data and combine matrices in a matrix multiplication in a totally different way. That is not how the architecture of SMs are made.
I for my part do not understand mma in terms of rows and cols, but in terms of dimension which is added over and (two) independent dimensions, which are just calculated in parallel.