Order of registers in MMA calls


We are porting some algorithms to use the Tensor Cores BMMA.

The tensor core mma fragments have a specific layout in registers. Documented for example here.

The documentation states that the A fragments are placed in registers as (citing):

A vector expression containing four .b32 registers

I could not find any details about vector expressions, e.g. what the required properties are.
Question1: Is there any documentation as to what a “vector expression” is?

In order to understand the motivation, here are some details:

As far as I understand, matrix fragments for 16x8x256 mma need to be loaded by hand. That works for us, and we get the expected results. However the documented fragment layout is a bit unfortunate for our later processing. Thus we would like to permute the input and output registers, in order to get rid of a separate distribution of the outputs over the threads. Right now the results are taking a round trip through shared memory.

Question2: Given the example from this post, could we do e.g.

      "mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.xor.popc "
      "{%0,%1,%2,%3}, {%4,%5, %6, %7}, {%8, %9}, {%10,%11,%12,%13};\n"
      : "=r"(D[2]), "=r"(D[3]), "=r"(D[0]), "=r"(D[1])   // 0 1 2 3 -> 2 3 0 1
      : "r"(A[3]), "r"(A[2]), "r"(A[1]), "r"(A[0]),      // 0 1 2 3 -> 3 2 1 0
        "r"(B[0]), "r"(B[1]),
        "r"(C[3]), "r"(C[1]), "r"(C[2]), "r"(C[0])       // 0 1 2 3 -> 3 1 2 0

It seems to work, however I am wary of correctness and performance considerations. Neither seem to be explained in the documentation.

(Bonus Question: Could one go as far as placing the same all-zero register for C four times? Would save 3 registers per thread.)

I’m not aware of any explicit documentation defining “vector expression”, however there are numerous mentions of vectors in the ptx guide.

Probably the most relevant section is here, from which we can surmise that a PTX vector expression is either:

For example, this is evidently a vector expression:

{%4, %5, %6, %7} 

in the context of inline PTX.

You can always file a bug to request doc clarifications.

I’m not aware of any issues there. The ptxas tool should flag illegal usage or combinations.

Again, no issues that I am aware of. The inline PTX manual specifically states that this is acceptable:

You can also repeat a reference, e.g.:

asm("add.s32 %0, %1, %1;" : "=r"(i) : "r"(k));

In all of this, it may be useful to keep in mind that PTX applies to a virtual machine model. The actual machine (i.e. SASS) certainly has a notion of vector registers also, and I believe that in that case, the SASS instruction operands must align to a linear register order for the base registers that “compose” a vector register. But SASS is not documented to this level, so that is just my conjecture based on observations. The key point is that a “seemingly” unordered set of registers in a PTX vector expression is probably meaningless (my guess). When the ptxas tool compiles your PTX code to SASS, it is probably going to use actual machine registers of its own choosing, and it may well do some amount of register “swizzling” (i.e. explicit data exchanges) to make everything work. Again, a purely observational notion of mine.

Thank you for the insights. Good point about the disconnect between PTX and SASS. The generated SASS indeed contains some extra annotations about those registers. It even seems to notice that C is always zero and puts the special zero register there. It also overwrites the input matrix A with the output if I am not mistaken:
BMMA.168256.XOR.POPC R8, R8.ROW, R16.COL, RZ

I guess this is where diving deeper does becomes complicated. I will stop here :)