Help needed to execute tcgen05.mma_cta_group::2 instructions

Hi everyone,

I’m encountering an “unspecified launch failure” error when executing the tcgen05.mma.cta_group::2instruction.

My kernel configuration is as follows:

config.gridDim = dim3(2, 1, 1);
config.blockDim = dim3(128, 1, 1);
attr.val.clusterDim = {2, 1, 1};

M256,N128,K16

Inside the kernel:

int cta_in_group = blockIdx.x % 2;
int tid = threadIdx.x;
int warp_id = tid / 32;
int lane_id = tid % 32;

TMEM allocation is done as:

if (warp_id == 0) { // single warp issue
asm volatile(
“tcgen05.alloc.cta_group::2.sync.aligned.shared::cta.b32 [%0], %1;”
:: “r”(tmem_addr), “r”(nCols)
);
}

The MMA instruction is:

if (tid == 0 && cta_in_group == 0) {
uint32_t disable_out_lane[8] = {0, 0, 0, 0, 0, 0, 0, 0};
asm volatile(“tcgen05.mma.cta_group::2.kind::f16 [%0], %1, %2, %3,{%4, %5, %6, %7, %8, %9, %10, %11}, 1, 0; \n\t”
:: “r”(taddr), “l”(mat_a_desc), “l”(mat_b_desc), “r”(uint32_t(idesc >> 32)),
“r”(disable_out_lane[0]), “r”(disable_out_lane[1]), “r”(disable_out_lane[2]), “r”(disable_out_lane[3]),
“r”(disable_out_lane[4]), “r”(disable_out_lane[5]), “r”(disable_out_lane[6]), “r”(disable_out_lane[7])
);
asm volatile(
“tcgen05.commit.cta_group::2.mbarrier::arrive::one.shared::cluster.b64 [%0];”
::“r”(mbar_ptr)
: “memory”
);
}

And the dealloc is :

if (warp_id == 0) {

    asm volatile (

        "tcgen05.dealloc.cta_group::2.sync.aligned.b32 %0, %1;"

        :: "r"(taddr), "r"(nCols)

        :"memory");    

}

What is even stranger is that while the program can be debugged step-by-step normally using cuda-gdb, running the program directly triggers the aforementioned error.

I’d appreciate any help or suggestions on what might be causing the illegal instruction error. Thanks!