Question about wgmma instruction in Hopper

skylark66 · October 24, 2024, 2:30pm

Hello, I have several questions about wgmma instruction.

The first question is: when the wgmma instruction is running in warp group, are the 4 warps executed in parallel on 4 tensorcores, or serially on 1 tensorcore? how the warps in the warpgroup map to the tensorcore? I guess it’s the former?

The second question is: How many cycles do the wgmma instruction take? Take m64n8k16.f16.f16.f16 as example, the total MACs are mxnxk = 8192, each tensorcore can perform m8n4k16 = 512 FP16 MACs per cycle, and there are 4 tensorcore in a SM, so m64n8k16 takes 8192 / (512x4) = 4 cycles, is that right?

The third question is about the shared memory bandwidth, I found the shared memory bandwidth is 128 bytes per cycle in hopper. If the wgmma (let’s say m64n8k16.f16.f16.f16) is running on 4 tensorcore in parallel, the input matrix A and B are all from shared memory, then the size of input A + B are (64x16 + 16x8) x sizeof(fp16) = 2304 bytes, it needs 2304 / 128 = 18 cycles to read A and B from shared memory to tensorcore, the shared memory will become bottleneck. Could someone explain where am I wrong?

The last question is, in the wgmma example of the PTX doc, the shape of matrix D should be MxN, which is 64x8, how the 4 32-bit registers ({f32d0, f32d1, f32d2, f32d3}) can represent 64x8 matrix? And how {f16a0, f16a1, f16a2, f16a3} can represent the A matrix?

wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16
  {f32d0, f32d1, f32d2, f32d3},
  {f16a0, f16a1, f16a2, f16a3},
  descB,
  1, -1, -1, 1;

Robert_Crovella · October 24, 2024, 2:57pm

I think the safe assumption is that there is no particular order of assignment of warps to SMSP’s. AFAIK it is unspecified by NVIDIA. Yes, there are some who will claim that consecutive warps are distributed to consecutive SMSPs. I’ve never seen that documented and I can think of cases where it might not make sense. Finally, I know of no way to verify it conclusively. In trivial cases it is probably safe to assume that consecutive warps get assigned to consecutive SMSPs, in a round-robin fashion (otherwise, the machine would be foolishly crippling itself). Since it is probably commonly the case, and I have no way to disprove otherwise, it may well be always the case, but AFAIK it is undocumented. I don’t know of any documented statements that suggest that the pattern may be disturbed somehow if the the warps in question have or don’t have wgmma ops in their instruction stream, somewhere.

AFAIK, NVIDIA does not document the elapsed cycles for any (PTX or) SASS instruction.

I don’t know that you are wrong. The MMA instructions all (WMMA, MMA, WGMMA) ultimately use a register patch in each thread (for operands). To get maximum throughput for Tensorcore (TC) ops, it seems self-evident that some reuse of operands is necessary; neither the shared memory system nor the global memory system have enough bandwidth to deliver completely fresh operands for each tile of a matrix-multiply, considered in the general case for large matrices. So looking at a single MMA op, and discovering that there is not enough bandwidth either from shared or global, to feed it at the full TC advertised rate, is not a surprising conclusion. Fortunately, large Matrix-multiply ops, decomposed into tiles, offer substantial opportunity for operand reuse.

A 64x8 matrix (for that op which specifies 32-bit result elements) has 512 32-bit elements. The four 32-bit registers per thread, when considered over four warps (128 threads) constitute also 512 32-bit elements. 64x8 = 4x32x4

rs277 · October 24, 2024, 5:52pm

There is some information in Table VIII here that may be of interest.

skylark66 · October 25, 2024, 7:38am

Thanks for your great reply!

Topic		Replies	Views
Is wgmma executed on the same EU or 4 EUs? CUDA Programming and Performance	1	124	October 11, 2024
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	171	December 12, 2025
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	124	September 23, 2024
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1650	July 7, 2023
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	7461	October 18, 2022
Wmma vs Wgmma On H100 GPU CUDA Programming and Performance cublas	5	55	December 15, 2025
How to use WMMA efficiently CUDA Programming and Performance	4	8891	October 23, 2020
[Question] How does the threads in a warp work collectively? CUDA Programming and Performance	5	245	July 8, 2024
Mma m8n8k4 on A100 CUDA Programming and Performance	10	402	November 14, 2024
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1098	December 8, 2018

Question about wgmma instruction in Hopper

Related topics