Understand the mma instruction in PTX

cwwu · May 29, 2024, 1:08pm

Hello, I’m studying the mma instruction in PTX. And I found the code below which runs correctly. However I can not understand why.

asm volatile(
    "mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 "
    "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
    : "=f"(C[cd[0]]),                                     // D[0]   32bit
      "=f"(C[cd[1]]),                                     // D[1]
      "=f"(C[cd[2]]),                                     // D[2]
      "=f"(C[cd[3]])                                      // D[3]
    : "r"(*reinterpret_cast<uint32_t const *>(&A[a[0]])), // A[0]   32bit
      "r"(*reinterpret_cast<uint32_t const *>(&A[a[1]])), // A[1]
      "r"(*reinterpret_cast<uint32_t const *>(&A[a[2]])), // A[2]
      "r"(*reinterpret_cast<uint32_t const *>(&A[a[3]])), // A[3]
      "r"(*reinterpret_cast<uint32_t const *>(&B[b[0]])), // B[0]
      "r"(*reinterpret_cast<uint32_t const *>(&B[b[1]])), // B[1]
      "f"(C[cd[0]]),                                      // C[0]
      "f"(C[cd[1]]),                                      // C[1]
      "f"(C[cd[2]]),                                      // C[2]
      "f"(C[cd[3]])                                       // C[3]
);

// a b cd store the current index of Matrix A B and CD

It ueses mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 instruction. I understand that it should use "f" in PTX code due to f32. However why A and B is "r" in PTX code? Shouldn’t it be of type .b32 due to tf32? PTX document says that " A register variable containing tf32 data must be declared with .b32 type.".
how to understand *reinterpret_cast<uint32_t const *>(&A[a[0]])? I do not understand why it use & and reinterpret_cast?
In PTX document, I don’t find the syntax "f" "r". Where can I get the syntax reference?

striker159 · May 29, 2024, 1:36pm

*reinterpret_cast<uint32_t const *>(&A[a[0]]) is standard c++. “treat the element A[a[0]] as uint32_t, and load it”

“h” = .u16 reg
“r” = .u32 reg
“l” = .u64 reg
“f” = .f32 reg
“d” = .f64 reg

cwwu · May 29, 2024, 1:52pm

wow~ thank you for your excellent answer!

Curefab · May 29, 2024, 4:48pm

wwa or mma?

cwwu · May 29, 2024, 4:59pm

sorry for the typo… I already correct it. Thanks~

system · June 12, 2024, 4:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Arguments mismatch for instruction 'mma', why? CUDA Programming and Performance	7	537	November 13, 2023
Problem with the instruction "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32" CUDA Programming and Performance	3	1912	October 12, 2021
Complete minimal ptx example for: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 CUDA Programming and Performance	3	405	October 25, 2024
Inline PTX and Mov CUDA Programming and Performance	13	1044	November 2, 2021
Mma instruction question in memory addressing CUDA Programming and Performance	1	362	September 6, 2023
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	106	July 15, 2024
Can I cvt uint32_t to tf32? CUDA Programming and Performance	2	351	October 30, 2023
How to use the `ex2.approx.f16x2` instruction? CUDA Programming and Performance	2	319	August 28, 2024
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6115	December 2, 2022
WMMA vs. MMA CUDA Programming and Performance	2	725	January 7, 2025

Understand the mma instruction in PTX

Related topics