Hi everyone,
I’m encountering an “unspecified launch failure” error when executing the tcgen05.mma.cta_group::2instruction.
My kernel configuration is as follows:
config.gridDim = dim3(2, 1, 1);
config.blockDim = dim3(128, 1, 1);
attr.val.clusterDim = {2, 1, 1};M256,N128,K16
Inside the kernel:
int cta_in_group = blockIdx.x % 2;
int tid = threadIdx.x;
int warp_id = tid / 32;
int lane_id = tid % 32;
TMEM allocation is done as:
if (warp_id == 0) { // single warp issue
asm volatile(
“tcgen05.alloc.cta_group::2.sync.aligned.shared::cta.b32 [%0], %1;”
:: “r”(tmem_addr), “r”(nCols)
);
}
The MMA instruction is:
if (tid == 0 && cta_in_group == 0) {
uint32_t disable_out_lane[8] = {0, 0, 0, 0, 0, 0, 0, 0};
asm volatile(“tcgen05.mma.cta_group::2.kind::f16 [%0], %1, %2, %3,{%4, %5, %6, %7, %8, %9, %10, %11}, 1, 0; \n\t”
:: “r”(taddr), “l”(mat_a_desc), “l”(mat_b_desc), “r”(uint32_t(idesc >> 32)),
“r”(disable_out_lane[0]), “r”(disable_out_lane[1]), “r”(disable_out_lane[2]), “r”(disable_out_lane[3]),
“r”(disable_out_lane[4]), “r”(disable_out_lane[5]), “r”(disable_out_lane[6]), “r”(disable_out_lane[7])
);
asm volatile(
“tcgen05.commit.cta_group::2.mbarrier::arrive::one.shared::cluster.b64 [%0];”
::“r”(mbar_ptr)
: “memory”
);
}
And the dealloc is :
if (warp_id == 0) { asm volatile ( "tcgen05.dealloc.cta_group::2.sync.aligned.b32 %0, %1;" :: "r"(taddr), "r"(nCols) :"memory"); }
What is even stranger is that while the program can be debugged step-by-step normally using cuda-gdb, running the program directly triggers the aforementioned error.
I’d appreciate any help or suggestions on what might be causing the illegal instruction error. Thanks!