Hi, I am working on a simple microbenchmark to measure the latency and throughput of int4 mma.sync
instruction in RTX3090 GPU. Here’s my initial code, which I referenced from this paper: GitHub - sunlex0717/DissectingTensorCores.
// setup A, B, C, and D matrix fragments in registers
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
asm volatile(
"mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
"{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
: "=r"(D[0]), "=r"(D[1]) , "=r"(D[2]), "=r"(D[3])
: "r"(A[0]), "r"(A[1]),
"r"(B[0]),
"r"(C[0]), "r"(C[1]) ,"r"(C[2]), "r"(C[3])
);
__syncwarp();
}
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");
I found that the latency of the MMA instruction is about 17 clock cycles. By increasing the number of warp, I was able to confirm that the above code achieves a throughput of 568 TOPS, which aligns with the peak INT4 tensor TOPS reported in the whitepaper.
I then modified the line "r"(C[0]), "r"(C[1]) ,"r"(C[2]), "r"(C[3])
to "r"(0), "r"(0), "r"(0), "r"(0)
to explicitly set the accumulator to be initialized to zero. I verified using cuobjdump that a register with a zero value is indeed being used as an input. However, I’ve found that the latency of MMA and add instruction is now around 12 clock cycles, and the throughput is more than 3 times higher than the peak INT4 TOPS. Here’s is my revised code:
// setup A, B, C, and D matrix fragments in registers
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
asm volatile(
"mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
"{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
: "=r"(D[0]), "=r"(D[1]) , "=r"(D[2]), "=r"(D[3])
: "r"(A[0]), "r"(A[1]),
"r"(B[0]),
"r"(0), "r"(0) ,"r"(0), "r"(0)
);
__syncwarp();
D[0] += C[0];
D[1] += C[1];
D[2] += C[2];
D[3] += C[3];
__syncwarp();
}
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");
From my understanding, the two codes should produce the same output, but the second code shows much higher throughput and shorter latency. Could you help me understand what I might be missing?