Why Does m8n8k16 Show Half Throughput and ldmatrix Latency Rise with More Warps?

Hi there!

I recently came across a paper about Tensor Cores microbenchmarks titled ”Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors”.

The paper provides some benchmark code and data, and I have a few questions regarding the PTX instruction results discussed in it.

Here’s the link to the paper: [Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors]

Question 1:

In the paper, I noticed that the following two instructions show a consistent difference in throughput:

mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32

mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.u8.s32

Specifically, the m8n8k16 instruction achieves only about half of the FMA/clk/SM compared to m16n8k16, regardless of warp count or ILP level.

Could someone explain what causes this difference?
My understanding is that with sufficient warp-level and ILP parallelism, their peak throughput should be similar under the same Tensor Core scheduling.

For reference, here is a simplified version of the microbenchmark code I used to test the throughput:

asm volatile("bar.sync 0;");
uint64_t start = 0, stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");

for (int j = 0; j < ITERS; ++j) {
    #pragma unroll
    for (int i = 0; i < ILPconfig; ++i) {
        asm volatile(
            "mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32 "
            "{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
            : "=r"(D[i * 4 + 0]), "=r"(D[i * 4 + 1]), "=r"(D[i * 4 + 2]), "=r"(D[i * 4 + 3])
            : "r"(A_u32[i * 2 + 0]), "r"(A_u32[i * 2 + 1]), "r"(B_u32[i]),
              "r"(C[i * 4 + 0]), "r"(C[i * 4 + 1]), "r"(C[i * 4 + 2]), "r"(C[i * 4 + 3])
        );
    }
    __syncwarp();
}

asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");

Question 2:

I also observed an interesting behavior with the ldmatrix instruction.
Regardless of the ILP configuration, when only a single warp is active, the latency per ldmatrix instruction is low;
however, when multiple warps are active simultaneously, the latency per instruction increases significantly.

What could be causing this behavior?
In some other papers, I’ve seen that using CUDA’s native load/store operations (ld/st) does not exhibit this issue, so I’m curious if this is related to Tensor Core data path contention, shared memory access conflicts, or internal scheduling of the ldmatrix pipeline.

I’d really appreciate it if someone could help clarify a few points. Thank you in advance!

Hi,

The table you shared is for the A100 desktop GPU.
Please file a topic on corresponding board instead.

Thanks.

Yes, you’re right — the results in that paper are mostly based on A100 and RTX 30x series GPUs.
Unfortunately, I don’t have access to those platforms, so I’ve been running the tests on Jetson Orin AGX instead.

My understanding is that since these are PTX-level microbenchmarks, the behavior should be largely similar across devices of the same architecture.
I just wanted to understand why these two instructions exhibit such a difference in characteristics under the same conditions.

Any insights or official documentation references would be really helpful. Thanks again for your time!

Hi,

You can try it with our cutlass library below:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.