The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction

out.reacher · August 21, 2024, 6:42am

I’ve seen How does 4x4 mma at tensor core level translate to 16x16 mma at warp level?, but I still have confusions.

Specifically, two HMMA.884 will indeed calculate 8x8x8 matrix(in pure FP16 precision). However, if you need 16x16x16 matrix, you must perform 8 of 8x8x8 operation instead of 4. For that reason, there seems to should have 2x8=16 HMMA.884 instructions, but it only generates 8 HHMA.884 instructions, why?

Curefab · August 21, 2024, 10:46am

Perhaps the code is for Turing after all? Or for mixed precision vs. for pure FP16?

The numbers of steps were reduced from 0…3 to 0…1:

out.reacher · August 21, 2024, 3:56pm

Yes I’ve read the paper, and my guess is that for fp16 mode each HMMA.884 actually calculates 8x8x8 according to the figure in the paper, and the 884 suffix is for mixed precision. But I don’t find any other source to support it.

Robert_Crovella · August 21, 2024, 4:12pm

An HMMA 884 op on sm_70 actually does four m=8,n=8,k=4 multiply ops. As far as I know this is a fairly unique variant of mma ops, and was fairly unique to the sm_70 (volta) architecture. Later architectures maintain binary compatibility for the op, but do not necessarily execute it in precisely the same way, and I believe somewhere in the PTX guide you can find a recommendation about using it on sm_70 but not elsewhere, performance wise:

Note

mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures.

in the previously linked forum post, I was trying to explain the decomposition of the C++ op into SASS operations, not necessarily precisely how the results are constructed (and I still don’t intend to explain the precise construction of the result.)

Another way to look at it is by comparing the implied number of FMA operations:

16x16x16 = 4096 FMA ops

each 884 op is 256 FMA ops. Four of them would be 1024 ops. So four part1/part2 SASS sequences must be enough to get the needed 4096 ops.

out.reacher · August 22, 2024, 3:23am

I think that mma.m8n8k4 instruction is at PTX-level instead of SASS? I am not sure how many HMMA instructions a mma.m8n8k4 instruction is translated into.

Robert_Crovella · August 22, 2024, 1:36pm

You have tools available to you to investigate that, if you choose. Even if you have no machine or compiler installed yourself, you can answer that question with an appropriate amount of work on the godbolt compiler-explorer. Or see here.

I have done this and have concluded that the conversion from PTX to SASS is completely consistent with my previous treatment in this thread, that is, it is what you would predict based on my comparison of the necessary FMA ops between the operation itself and what we see in the SASS.

Topic		Replies	Views
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	1013	November 15, 2023
Mma m8n8k4 on A100 CUDA Programming and Performance	10	166	November 14, 2024
How to choose from so many `mma.mnk*` instructions? CUDA Programming and Performance	3	313	June 19, 2024
Why does WMMA and MMA support different matrix tile size? CUDA Programming and Performance	2	2004	October 28, 2023
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	117	July 15, 2024
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	935	August 22, 2024
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1024	December 8, 2018
Direct access to Volta HMMA instruction CUDA Programming and Performance	9	5248	December 19, 2017
About tensor core's flops/clk and wmma shape? CUDA Programming and Performance	1	968	October 22, 2023
Question about wgmma instruction in Hopper CUDA Programming and Performance	3	810	October 25, 2024

The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction

Related topics