Why Does m8n8k16 Show Half Throughput and ldmatrix Latency Rise with More Warps?

wangjingfan726 · October 15, 2025, 11:42am

Hi there!

I recently came across a paper about Tensor Cores microbenchmarks titled ”Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors”.

The paper provides some benchmark code and data, and I have a few questions regarding the PTX instruction results discussed in it.

Here’s the link to the paper: [Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors]

Question 1:

In the paper, I noticed that the following two instructions show a consistent difference in throughput:

mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32

mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.u8.s32

Specifically, the m8n8k16 instruction achieves only about half of the FMA/clk/SM compared to m16n8k16, regardless of warp count or ILP level.

Could someone explain what causes this difference?
My understanding is that with sufficient warp-level and ILP parallelism, their peak throughput should be similar under the same Tensor Core scheduling.

For reference, here is a simplified version of the microbenchmark code I used to test the throughput:

asm volatile("bar.sync 0;");
uint64_t start = 0, stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");

for (int j = 0; j < ITERS; ++j) {
    #pragma unroll
    for (int i = 0; i < ILPconfig; ++i) {
        asm volatile(
            "mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32 "
            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
            : "=r"(D[i * 4 + 0]), "=r"(D[i * 4 + 1]), "=r"(D[i * 4 + 2]), "=r"(D[i * 4 + 3])
            : "r"(A_u32[i * 2 + 0]), "r"(A_u32[i * 2 + 1]), "r"(B_u32[i]),
              "r"(C[i * 4 + 0]), "r"(C[i * 4 + 1]), "r"(C[i * 4 + 2]), "r"(C[i * 4 + 3])
        );
    }
    __syncwarp();
}

asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

Question 2:

I also observed an interesting behavior with the ldmatrix instruction.
Regardless of the ILP configuration, when only a single warp is active, the latency per ldmatrix instruction is low;
however, when multiple warps are active simultaneously, the latency per instruction increases significantly.

What could be causing this behavior?
In some other papers, I’ve seen that using CUDA’s native load/store operations (ld/st) does not exhibit this issue, so I’m curious if this is related to Tensor Core data path contention, shared memory access conflicts, or internal scheduling of the ldmatrix pipeline.

I’d really appreciate it if someone could help clarify a few points. Thank you in advance!

AastaLLL · October 17, 2025, 4:51am

Hi,

The table you shared is for the A100 desktop GPU.
Please file a topic on corresponding board instead.

Thanks.

wangjingfan726 · October 20, 2025, 7:30am

Yes, you’re right — the results in that paper are mostly based on A100 and RTX 30x series GPUs.
Unfortunately, I don’t have access to those platforms, so I’ve been running the tests on Jetson Orin AGX instead.

My understanding is that since these are PTX-level microbenchmarks, the behavior should be largely similar across devices of the same architecture.
I just wanted to understand why these two instructions exhibit such a difference in characteristics under the same conditions.

Any insights or official documentation references would be really helpful. Thanks again for your time!

AastaLLL · October 22, 2025, 8:45am

Hi,

You can try it with our cutlass library below:

Thanks.

system · November 19, 2025, 3:00am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Throughput and latency of mma.sync instruction CUDA Programming and Performance	5	502	September 2, 2024
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	173	December 12, 2025
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1393	March 22, 2022
Operation time compare CUDA Programming and Performance	4	111	December 16, 2024
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	202	July 15, 2024
[Question] How does the threads in a warp work collectively? CUDA Programming and Performance	5	245	July 8, 2024
Mma m8n8k4 on A100 CUDA Programming and Performance	10	402	November 14, 2024
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	2566	September 27, 2024
Help with ldmatrix instruction CUDA Programming and Performance	2	85	June 3, 2025
How to choose from so many `mma.mnk*` instructions? CUDA Programming and Performance	3	465	June 19, 2024

Why Does m8n8k16 Show Half Throughput and ldmatrix Latency Rise with More Warps?

Question 1:

Question 2:

Related topics