Help needed to execute tcgen05.mma_cta_group::2 instructions

pkumail · December 7, 2025, 3:13pm

Hi everyone,

I’m encountering an “unspecified launch failure” error when executing the tcgen05.mma.cta_group::2instruction.

My kernel configuration is as follows:

config.gridDim = dim3(2, 1, 1);
config.blockDim = dim3(128, 1, 1);
attr.val.clusterDim = {2, 1, 1};

M256,N128,K16

Inside the kernel:

int cta_in_group = blockIdx.x % 2;
int tid = threadIdx.x;
int warp_id = tid / 32;
int lane_id = tid % 32;

TMEM allocation is done as:

if (warp_id == 0) { // single warp issue
asm volatile(
“tcgen05.alloc.cta_group::2.sync.aligned.shared::cta.b32 [%0], %1;”
:: “r”(tmem_addr), “r”(nCols)
);
}

The MMA instruction is:

if (tid == 0 && cta_in_group == 0) {
uint32_t disable_out_lane[8] = {0, 0, 0, 0, 0, 0, 0, 0};
asm volatile(“tcgen05.mma.cta_group::2.kind::f16 [%0], %1, %2, %3,{%4, %5, %6, %7, %8, %9, %10, %11}, 1, 0; \n\t”
:: “r”(taddr), “l”(mat_a_desc), “l”(mat_b_desc), “r”(uint32_t(idesc >> 32)),
“r”(disable_out_lane[0]), “r”(disable_out_lane[1]), “r”(disable_out_lane[2]), “r”(disable_out_lane[3]),
“r”(disable_out_lane[4]), “r”(disable_out_lane[5]), “r”(disable_out_lane[6]), “r”(disable_out_lane[7])
);
asm volatile(
“tcgen05.commit.cta_group::2.mbarrier::arrive::one.shared::cluster.b64 [%0];”
::“r”(mbar_ptr)
: “memory”
);
}

And the dealloc is :

if (warp_id == 0) {

    asm volatile (

        "tcgen05.dealloc.cta_group::2.sync.aligned.b32 %0, %1;"

        :: "r"(taddr), "r"(nCols)

        :"memory");    

}

What is even stranger is that while the program can be debugged step-by-step normally using cuda-gdb, running the program directly triggers the aforementioned error.

I’d appreciate any help or suggestions on what might be causing the illegal instruction error. Thanks!

Topic		Replies	Views
[H100 \| SM90a] Illegal Instruction in TMA Load (tma_copy_3d) with CUDA 12.9 CUDA Programming and Performance cuda , kernel	0	162	May 29, 2025
Illegal instruction when using copy() in a simple TMA demo CUDA Programming and Performance	0	211	April 23, 2025
Address out of bounds when using mma instructions CUDA Programming and Performance cuda	7	333	June 11, 2024
Unspecified launch failure when launch my kernel CUDA Programming and Performance	2	3485	June 15, 2010
unspecified launch failure: This error is in cudaMemcpy CUDA Programming and Performance	9	23358	February 16, 2010
"an illegal instruction was encountered" CUDA Programming and Performance	9	13347	August 2, 2016
unspecified launch failure CUDA Programming and Performance	2	9099	January 21, 2009
Kernel Launch Failure Simple Kernel with 2D array fails init CUDA Programming and Performance	7	9442	November 19, 2007
CUDA Bug: "CUDA error: unspecified launch failure" CUDA Programming and Performance	7	11940	March 11, 2011
Unspecified launch failure error CUDA Programming and Performance	10	19972	January 6, 2018

Help needed to execute tcgen05.mma_cta_group::2 instructions

Related topics