Throughput and latency of mma.sync instruction

Hi, I am working on a simple microbenchmark to measure the latency and throughput of int4 mma.sync instruction in RTX3090 GPU. Here’s my initial code, which I referenced from this paper: GitHub - sunlex0717/DissectingTensorCores.

// setup A, B, C, and D matrix fragments in registers
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
    asm volatile(
        "mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
        : "=r"(D[0]), "=r"(D[1]) , "=r"(D[2]), "=r"(D[3])
        : "r"(A[0]), "r"(A[1]),
          "r"(B[0]),
          "r"(C[0]), "r"(C[1]) ,"r"(C[2]), "r"(C[3])
    );
    __syncwarp();
  }
__syncthreads();
  asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

I found that the latency of the MMA instruction is about 17 clock cycles. By increasing the number of warp, I was able to confirm that the above code achieves a throughput of 568 TOPS, which aligns with the peak INT4 tensor TOPS reported in the whitepaper.

I then modified the line "r"(C[0]), "r"(C[1]) ,"r"(C[2]), "r"(C[3]) to "r"(0), "r"(0), "r"(0), "r"(0) to explicitly set the accumulator to be initialized to zero. I verified using cuobjdump that a register with a zero value is indeed being used as an input. However, I’ve found that the latency of MMA and add instruction is now around 12 clock cycles, and the throughput is more than 3 times higher than the peak INT4 TOPS. Here’s is my revised code:

// setup A, B, C, and D matrix fragments in registers
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
    asm volatile(
        "mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
        : "=r"(D[0]), "=r"(D[1]) , "=r"(D[2]), "=r"(D[3])
        : "r"(A[0]), "r"(A[1]),
          "r"(B[0]),
          "r"(0), "r"(0) ,"r"(0), "r"(0)
    );
    __syncwarp();
    D[0] += C[0];
    D[1] += C[1];
    D[2] += C[2];
    D[3] += C[3];
    __syncwarp();
  }
__syncthreads();
  asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

From my understanding, the two codes should produce the same output, but the second code shows much higher throughput and shorter latency. Could you help me understand what I might be missing?

Could be that the dependencies are not considered correctly any longer.

In the original code C and D point to the same memory address (aliasing) with a reinterpret_cast (often UB). The mma instruction has a dependency between the threads of a warp in memory addresses not listed in the asm arguments (as it is the local memory or actually the registers of other threads).

So the dependencies were not very clear to begin with. And it is a wonder it works at all.

Is the SASS code perfectly identical except RZ? Even the control bits? You could also patch it directly, replacing the C registers with RZ. (Or it is not RZ, but a register loaded with 0?)

1 Like

Sorry for the late response. As @Curefab pointed out, I found that dependencies are not correctly considered anymore. When I used temporary registers to hold the intermediate output, I got the expected result (same expected latency and throughput). Thanks!

1 Like

Hi Jun98,
thank you for the update!
Could you post a very short code sample, what you exactly mean by using temporary registers, so others reading the post could do the same.

Sure, below is a simple code snippet that I used. I verified that it produces an identical result to the original MMA instruction using simple test cases.
I am not sure why, but the below core shows slightly faster latency and higher throughput (about 1.1 - 1.2x). It would be great if you could provide any thoughts or guesses related to this.

// setup A, B, C, and D matrix fragments in registers
int32_t psum[4];  // tmp register for partial sum
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
    asm volatile(
        "mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
        : "=r"(psum[0]), "=r"(psum[1]) , "=r"(psum[2]), "=r"(psum[3])
        : "r"(A[0]), "r"(A[1]),
          "r"(B[0]),
          "r"(0), "r"(0) ,"r"(0), "r"(0)
    );
    __syncwarp();
    D[0] = psum[0] + C[0];
    D[1] = psum[1] + C[1];
    D[2] = psum[2] + C[2];
    D[3] = psum[3] + C[3];
}
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.