Throughput and latency of mma.sync instruction

Jun98 · August 11, 2024, 6:51pm

Hi, I am working on a simple microbenchmark to measure the latency and throughput of int4 mma.sync instruction in RTX3090 GPU. Here’s my initial code, which I referenced from this paper: GitHub - sunlex0717/DissectingTensorCores.

// setup A, B, C, and D matrix fragments in registers
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
    asm volatile(
        "mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
        : "=r"(D[0]), "=r"(D[1]) , "=r"(D[2]), "=r"(D[3])
        : "r"(A[0]), "r"(A[1]),
          "r"(B[0]),
          "r"(C[0]), "r"(C[1]) ,"r"(C[2]), "r"(C[3])
    );
    __syncwarp();
  }
__syncthreads();
  asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

I found that the latency of the MMA instruction is about 17 clock cycles. By increasing the number of warp, I was able to confirm that the above code achieves a throughput of 568 TOPS, which aligns with the peak INT4 tensor TOPS reported in the whitepaper.

I then modified the line "r"(C[0]), "r"(C[1]) ,"r"(C[2]), "r"(C[3]) to "r"(0), "r"(0), "r"(0), "r"(0) to explicitly set the accumulator to be initialized to zero. I verified using cuobjdump that a register with a zero value is indeed being used as an input. However, I’ve found that the latency of MMA and add instruction is now around 12 clock cycles, and the throughput is more than 3 times higher than the peak INT4 TOPS. Here’s is my revised code:

// setup A, B, C, and D matrix fragments in registers
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
    asm volatile(
        "mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
        : "=r"(D[0]), "=r"(D[1]) , "=r"(D[2]), "=r"(D[3])
        : "r"(A[0]), "r"(A[1]),
          "r"(B[0]),
          "r"(0), "r"(0) ,"r"(0), "r"(0)
    );
    __syncwarp();
    D[0] += C[0];
    D[1] += C[1];
    D[2] += C[2];
    D[3] += C[3];
    __syncwarp();
  }
__syncthreads();
  asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

From my understanding, the two codes should produce the same output, but the second code shows much higher throughput and shorter latency. Could you help me understand what I might be missing?

Curefab · August 11, 2024, 9:08pm

Could be that the dependencies are not considered correctly any longer.

In the original code C and D point to the same memory address (aliasing) with a reinterpret_cast (often UB). The mma instruction has a dependency between the threads of a warp in memory addresses not listed in the asm arguments (as it is the local memory or actually the registers of other threads).

So the dependencies were not very clear to begin with. And it is a wonder it works at all.

Is the SASS code perfectly identical except RZ? Even the control bits? You could also patch it directly, replacing the C registers with RZ. (Or it is not RZ, but a register loaded with 0?)

Jun98 · August 19, 2024, 2:16am

Sorry for the late response. As @Curefab pointed out, I found that dependencies are not correctly considered anymore. When I used temporary registers to hold the intermediate output, I got the expected result (same expected latency and throughput). Thanks!

Curefab · August 19, 2024, 7:16am

Hi Jun98,
thank you for the update!
Could you post a very short code sample, what you exactly mean by using temporary registers, so others reading the post could do the same.

Jun98 · August 19, 2024, 2:48pm

Sure, below is a simple code snippet that I used. I verified that it produces an identical result to the original MMA instruction using simple test cases.
I am not sure why, but the below core shows slightly faster latency and higher throughput (about 1.1 - 1.2x). It would be great if you could provide any thoughts or guesses related to this.

// setup A, B, C, and D matrix fragments in registers
int32_t psum[4];  // tmp register for partial sum
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < 1000; ++j) {
    asm volatile(
        "mma.sync.aligned.m16n8k32.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
        : "=r"(psum[0]), "=r"(psum[1]) , "=r"(psum[2]), "=r"(psum[3])
        : "r"(A[0]), "r"(A[1]),
          "r"(B[0]),
          "r"(0), "r"(0) ,"r"(0), "r"(0)
    );
    __syncwarp();
    D[0] = psum[0] + C[0];
    D[1] = psum[1] + C[1];
    D[2] = psum[2] + C[2];
    D[3] = psum[3] + C[3];
}
__syncthreads();
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

system · September 2, 2024, 2:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why Does m8n8k16 Show Half Throughput and ldmatrix Latency Rise with More Warps? Jetson AGX Orin cuda	4	73	October 22, 2025
Help, Tensor core programming CUDA Developer Tools	0	323	June 18, 2020
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	173	December 12, 2025
Implicit Warp Synchronization prevents hiding of memory latency CUDA Programming and Performance	6	2745	January 6, 2010
Address out of bounds when using mma instructions CUDA Programming and Performance cuda	8	281	June 11, 2024
Understand the mma instruction in PTX CUDA Programming and Performance	5	1189	June 12, 2024
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1393	March 22, 2022
Confused about (a,b)layout on mma.sync instructions CUDA Programming and Performance	8	1105	March 23, 2024
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	202	July 15, 2024
PTXAS: mysterious warning for wgmma.mma_async instruction serialization CUDA NVCC Compiler	8	341	August 15, 2025

Throughput and latency of mma.sync instruction

Related topics