Hello, I have several questions about wgmma instruction.
The first question is: when the wgmma instruction is running in warp group, are the 4 warps executed in parallel on 4 tensorcores, or serially on 1 tensorcore? how the warps in the warpgroup map to the tensorcore? I guess it’s the former?
The second question is: How many cycles do the wgmma instruction take? Take m64n8k16.f16.f16.f16 as example, the total MACs are mxnxk = 8192, each tensorcore can perform m8n4k16 = 512 FP16 MACs per cycle, and there are 4 tensorcore in a SM, so m64n8k16 takes 8192 / (512x4) = 4 cycles, is that right?
The third question is about the shared memory bandwidth, I found the shared memory bandwidth is 128 bytes per cycle in hopper. If the wgmma (let’s say m64n8k16.f16.f16.f16) is running on 4 tensorcore in parallel, the input matrix A and B are all from shared memory, then the size of input A + B are (64x16 + 16x8) x sizeof(fp16) = 2304 bytes, it needs 2304 / 128 = 18 cycles to read A and B from shared memory to tensorcore, the shared memory will become bottleneck. Could someone explain where am I wrong?
The last question is, in the wgmma example of the PTX doc, the shape of matrix D should be MxN, which is 64x8, how the 4 32-bit registers ({f32d0, f32d1, f32d2, f32d3}) can represent 64x8 matrix? And how {f16a0, f16a1, f16a2, f16a3} can represent the A matrix?
wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16
{f32d0, f32d1, f32d2, f32d3},
{f16a0, f16a1, f16a2, f16a3},
descB,
1, -1, -1, 1;