Setnmaxreg and printf coexist in kernel cause illegal insutrction runtime error

Run below simple kernel in H100 GPU. I get runtime error from cudaGetLastError() and cudaDeviceSynchronize().
The CUDA error says: “an illegal instruction was encountered”.
But if I remove printf(), the kernel can successfully run. Is there any constrain between stenmaxreg and printf function call ?

template<uint32_t RegCount>
__device__ __inline__ void warpgroup_reg_alloc(){
  asm volatile( "setmaxnreg.inc.sync.aligned.u32 %0;\n" : : "n"(RegCount) );
}

template<uint32_t RegCount>
__device__ __inline__ void warpgroup_reg_dealloc(){
  asm volatile( "setmaxnreg.dec.sync.aligned.u32 %0;\n" : : "n"(RegCount) );
}

__global__
__launch_bounds__(768, 1)
void divergentKernelError(int *input, int *output) {
  int thread_id = blockIdx.x*blockDim.x + threadIdx.x;
  int seed = *(input + thread_id);
  int warpgroup_id = thread_id/128;
  int scratch_data[BATCH_SIZE];

  if((warpgroup_id&0x01) == 0){
    // even warp group do nothing, only write pattern data, sacrifice its register resource
    warpgroup_reg_dealloc<24>();
    printf("decrease register \n");
  }else{
    // odd warp group get enough register, and do compute
    warpgroup_reg_alloc<128>();           // 80+80-24 = 136
    printf("increase register \n");
  }
}

When posting code on these forums, please format it correctly. One possible method: Edit your post using the pencil icon below it. Select the code. Press the </> button at the top of the edit pane. Save your changes.

Please do that now, thanks.


Sorry. There is no edit button in my view.

You can make an ordinary reply with the properly formatted code if you wish.

I fixed the formatting.

There is a minimal number of registers required for printf but I don’t recall if it is 32 or 64 registers/thread. I would avoid any system calls when using setmaxnreg. I would recommend filing a bug to get an official response from the compiler and libraries team.

Thanks all for the help.
The illegal instruction runtime error is really caused by implicit register usage by printf(). When I increase the register quota slightly, the error goes away.