Hi there!
I recently came across a paper about Tensor Cores microbenchmarks titled ”Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors”.
The paper provides some benchmark code and data, and I have a few questions regarding the PTX instruction results discussed in it.
Here’s the link to the paper: [Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors]
Question 1:
In the paper, I noticed that the following two instructions show a consistent difference in throughput:
mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32
mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.u8.s32
Specifically, the m8n8k16 instruction achieves only about half of the FMA/clk/SM compared to m16n8k16, regardless of warp count or ILP level.
Could someone explain what causes this difference?
My understanding is that with sufficient warp-level and ILP parallelism, their peak throughput should be similar under the same Tensor Core scheduling.
For reference, here is a simplified version of the microbenchmark code I used to test the throughput:
asm volatile("bar.sync 0;");
uint64_t start = 0, stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start)::"memory");
for (int j = 0; j < ITERS; ++j) {
#pragma unroll
for (int i = 0; i < ILPconfig; ++i) {
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.satfinite.s32.s8.u8.s32 "
"{%0, %1, %2, %3}, {%4, %5}, {%6}, {%7, %8, %9, %10};\n"
: "=r"(D[i * 4 + 0]), "=r"(D[i * 4 + 1]), "=r"(D[i * 4 + 2]), "=r"(D[i * 4 + 3])
: "r"(A_u32[i * 2 + 0]), "r"(A_u32[i * 2 + 1]), "r"(B_u32[i]),
"r"(C[i * 4 + 0]), "r"(C[i * 4 + 1]), "r"(C[i * 4 + 2]), "r"(C[i * 4 + 3])
);
}
__syncwarp();
}
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop)::"memory");
Question 2:
I also observed an interesting behavior with the ldmatrix instruction.
Regardless of the ILP configuration, when only a single warp is active, the latency per ldmatrix instruction is low;
however, when multiple warps are active simultaneously, the latency per instruction increases significantly.
What could be causing this behavior?
In some other papers, I’ve seen that using CUDA’s native load/store operations (ld/st) does not exhibit this issue, so I’m curious if this is related to Tensor Core data path contention, shared memory access conflicts, or internal scheduling of the ldmatrix pipeline.
I’d really appreciate it if someone could help clarify a few points. Thank you in advance!
