Question about wgmma instruction in Hopper

Hello, I have several questions about wgmma instruction.

The first question is: when the wgmma instruction is running in warp group, are the 4 warps executed in parallel on 4 tensorcores, or serially on 1 tensorcore? how the warps in the warpgroup map to the tensorcore? I guess it’s the former?

The second question is: How many cycles do the wgmma instruction take? Take m64n8k16.f16.f16.f16 as example, the total MACs are mxnxk = 8192, each tensorcore can perform m8n4k16 = 512 FP16 MACs per cycle, and there are 4 tensorcore in a SM, so m64n8k16 takes 8192 / (512x4) = 4 cycles, is that right?

The third question is about the shared memory bandwidth, I found the shared memory bandwidth is 128 bytes per cycle in hopper. If the wgmma (let’s say m64n8k16.f16.f16.f16) is running on 4 tensorcore in parallel, the input matrix A and B are all from shared memory, then the size of input A + B are (64x16 + 16x8) x sizeof(fp16) = 2304 bytes, it needs 2304 / 128 = 18 cycles to read A and B from shared memory to tensorcore, the shared memory will become bottleneck. Could someone explain where am I wrong?

The last question is, in the wgmma example of the PTX doc, the shape of matrix D should be MxN, which is 64x8, how the 4 32-bit registers ({f32d0, f32d1, f32d2, f32d3}) can represent 64x8 matrix? And how {f16a0, f16a1, f16a2, f16a3} can represent the A matrix?

wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16
  {f32d0, f32d1, f32d2, f32d3},
  {f16a0, f16a1, f16a2, f16a3},
  descB,
  1, -1, -1, 1;

I think the safe assumption is that there is no particular order of assignment of warps to SMSP’s. AFAIK it is unspecified by NVIDIA. Yes, there are some who will claim that consecutive warps are distributed to consecutive SMSPs. I’ve never seen that documented and I can think of cases where it might not make sense. Finally, I know of no way to verify it conclusively. In trivial cases it is probably safe to assume that consecutive warps get assigned to consecutive SMSPs, in a round-robin fashion (otherwise, the machine would be foolishly crippling itself). Since it is probably commonly the case, and I have no way to disprove otherwise, it may well be always the case, but AFAIK it is undocumented. I don’t know of any documented statements that suggest that the pattern may be disturbed somehow if the the warps in question have or don’t have wgmma ops in their instruction stream, somewhere.

AFAIK, NVIDIA does not document the elapsed cycles for any (PTX or) SASS instruction.

I don’t know that you are wrong. The MMA instructions all (WMMA, MMA, WGMMA) ultimately use a register patch in each thread (for operands). To get maximum throughput for Tensorcore (TC) ops, it seems self-evident that some reuse of operands is necessary; neither the shared memory system nor the global memory system have enough bandwidth to deliver completely fresh operands for each tile of a matrix-multiply, considered in the general case for large matrices. So looking at a single MMA op, and discovering that there is not enough bandwidth either from shared or global, to feed it at the full TC advertised rate, is not a surprising conclusion. Fortunately, large Matrix-multiply ops, decomposed into tiles, offer substantial opportunity for operand reuse.

A 64x8 matrix (for that op which specifies 32-bit result elements) has 512 32-bit elements. The four 32-bit registers per thread, when considered over four warps (128 threads) constitute also 512 32-bit elements. 64x8 = 4x32x4

1 Like

There is some information in Table VIII here that may be of interest.

Thanks for your great reply!