The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction

I’ve seen How does 4x4 mma at tensor core level translate to 16x16 mma at warp level?, but I still have confusions.

Specifically, two HMMA.884 will indeed calculate 8x8x8 matrix(in pure FP16 precision). However, if you need 16x16x16 matrix, you must perform 8 of 8x8x8 operation instead of 4. For that reason, there seems to should have 2x8=16 HMMA.884 instructions, but it only generates 8 HHMA.884 instructions, why?

Perhaps the code is for Turing after all? Or for mixed precision vs. for pure FP16?

The numbers of steps were reduced from 0…3 to 0…1:

Yes I’ve read the paper, and my guess is that for fp16 mode each HMMA.884 actually calculates 8x8x8 according to the figure in the paper, and the 884 suffix is for mixed precision. But I don’t find any other source to support it.

An HMMA 884 op on sm_70 actually does four m=8,n=8,k=4 multiply ops. As far as I know this is a fairly unique variant of mma ops, and was fairly unique to the sm_70 (volta) architecture. Later architectures maintain binary compatibility for the op, but do not necessarily execute it in precisely the same way, and I believe somewhere in the PTX guide you can find a recommendation about using it on sm_70 but not elsewhere, performance wise:

Note

mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures.

in the previously linked forum post, I was trying to explain the decomposition of the C++ op into SASS operations, not necessarily precisely how the results are constructed (and I still don’t intend to explain the precise construction of the result.)

Another way to look at it is by comparing the implied number of FMA operations:

16x16x16 = 4096 FMA ops

each 884 op is 256 FMA ops. Four of them would be 1024 ops. So four part1/part2 SASS sequences must be enough to get the needed 4096 ops.

I think that mma.m8n8k4 instruction is at PTX-level instead of SASS? I am not sure how many HMMA instructions a mma.m8n8k4 instruction is translated into.

You have tools available to you to investigate that, if you choose. Even if you have no machine or compiler installed yourself, you can answer that question with an appropriate amount of work on the godbolt compiler-explorer. Or see here.

I have done this and have concluded that the conversion from PTX to SASS is completely consistent with my previous treatment in this thread, that is, it is what you would predict based on my comparison of the necessary FMA ops between the operation itself and what we see in the SASS.