Help needed to explain tcgen05.mma_cta_group instructions

smartvoice · February 3, 2025, 4:31pm

Hi All,

What does the 1.kind and 2.kind mean for the following tcgen05 ptx instructions?

“tcgen05.mma.cta_group::1.kind::f16 [%0], [%1], %2, %3, {%5, %6, %7, %8}, p; \n\t”
“tcgen05.mma.cta_group::2.kind::f16 [%0], [%1], %2, %3, {%5, %6, %7, %8, %9, %10, %11, %12}, p; \n\t”

Thanks,

Curefab · February 3, 2025, 5:04pm

Hi smartvoice,
it is not 1.kind and 2.kind, but cta_group::1 and cta_group::2.
The mma is executed on one SM or a pair of SMs. So up to 8 tensor core units (4 per SM) can work together.

striker159 · February 3, 2025, 5:06pm

The identifiers are cta_group::1 and cta_group::2 , not 1.kind and 2.kind.
Their explanation are given in the ptx docs

smartvoice · February 6, 2025, 6:34pm

Thanks for pointing it out.

For the scaling factors, it seems that the tcgen05 instructions reads from shared memory. Is it correct?

Are all the scaling factors are precomputed, then use tma to copy from global memory to shared memory before feeding to the tensor core?

    if (cute::elect_one_sync()) {
      asm volatile(
        "{\n\t"
        ".reg .pred p;\n\t"
        "setp.ne.b32 p, %4, 0;\n\t"
        "tcgen05.mma.cta_group::2.kind::mxf8f6f4.block_scale [%0], %1, %2, %3, [%5], [%6], p; \n\t"
        "}\n"
        :
        : "r"(tmem_c), "l"(desc_a), "l"(desc_b), "r"(uint32_t(idescE>>32)), "r"(scaleC),
          "r"(tsfa_addr), "r"(tsfb_addr));
    }

Topic		Replies	Views
Tile size question when using tcgen05.mma operation for f16 CUDA Programming and Performance cuda , kernel	0	70	September 11, 2025
Run ptx (mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32) on sm_120a CUDA Programming and Performance	1	333	April 9, 2025
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	436	December 12, 2025
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	154	September 23, 2024
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1434	March 22, 2022
Get wrong result using tensor core example CUDA Programming and Performance cuda , kernel	8	563	July 1, 2024
How to choose from so many `mma.mnk*` instructions? CUDA Programming and Performance	3	554	June 19, 2024
Cuda operations along side Tensor operations CUDA Programming and Performance	2	552	October 12, 2021
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	1538	August 22, 2024
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1125	December 8, 2018

Help needed to explain tcgen05.mma_cta_group instructions

Related topics