Cuda API error detected: cudaLaunchKernel returned (0x2bd)

Hey, I was tried to copy data in global memory, every block transfer 32*block_size*sizeof(double) data, but some error occurred, my code as follows:

#include<cuda_runtime.h>
#include<iostream>

__global__ void copyReg_v2(double * in,double * out)
{
    int idx=threadIdx.x+blockDim.x*blockIdx.x*32;
    
    double tmp0 = in[idx];
    double tmp1 = in[idx+blockDim.x];
    double tmp2 = in[idx+blockDim.x*2];
    double tmp3 = in[idx+blockDim.x*3];
    double tmp4 = in[idx+blockDim.x*4];
    double tmp5 = in[idx+blockDim.x*5];
    double tmp6 = in[idx+blockDim.x*6];
    double tmp7 = in[idx+blockDim.x*7];
    double tmp8 = in[idx+blockDim.x*8];
    double tmp9 = in[idx+blockDim.x*9];
    double tmp10 = in[idx+blockDim.x*10];
    double tmp11 = in[idx+blockDim.x*11];
    double tmp12 = in[idx+blockDim.x*12];
    double tmp13 = in[idx+blockDim.x*13];
    double tmp14 = in[idx+blockDim.x*14];
    double tmp15 = in[idx+blockDim.x*15];
    double tmp16 = in[idx+blockDim.x*16];
    double tmp17 = in[idx+blockDim.x*17];
    double tmp18 = in[idx+blockDim.x*18];
    double tmp19 = in[idx+blockDim.x*19];
    double tmp20 = in[idx+blockDim.x*20];
    double tmp21 = in[idx+blockDim.x*21];
    double tmp22 = in[idx+blockDim.x*22];
    double tmp23 = in[idx+blockDim.x*23];
    double tmp24 = in[idx+blockDim.x*24];
    double tmp25 = in[idx+blockDim.x*25];
    double tmp26 = in[idx+blockDim.x*26];
    double tmp27 = in[idx+blockDim.x*27];
    double tmp28 = in[idx+blockDim.x*28];
    double tmp29 = in[idx+blockDim.x*29];
    double tmp30 = in[idx+blockDim.x*30];
    double tmp31 = in[idx+blockDim.x*31];

    out[idx] = tmp0 ;
    out[idx+blockDim.x] = tmp1 ;
    out[idx+blockDim.x*2] = tmp2 ;
    out[idx+blockDim.x*3] = tmp3 ;
    out[idx+blockDim.x*4] = tmp4 ;
    out[idx+blockDim.x*5] = tmp5 ;
    out[idx+blockDim.x*6] = tmp6 ;
    out[idx+blockDim.x*7] = tmp7 ;
    out[idx+blockDim.x*8] = tmp8 ;
    out[idx+blockDim.x*9] = tmp9 ;
    out[idx+blockDim.x*10] = tmp10;
    out[idx+blockDim.x*11] = tmp11;
    out[idx+blockDim.x*12] = tmp12;
    out[idx+blockDim.x*13] = tmp13;
    out[idx+blockDim.x*14] = tmp14;
    out[idx+blockDim.x*15] = tmp15;
    out[idx+blockDim.x*16] = tmp16;
    out[idx+blockDim.x*17] = tmp17;
    out[idx+blockDim.x*18] = tmp18;
    out[idx+blockDim.x*19] = tmp19;
    out[idx+blockDim.x*20] = tmp20;
    out[idx+blockDim.x*21] = tmp21;
    out[idx+blockDim.x*22] = tmp22;
    out[idx+blockDim.x*23] = tmp23;
    out[idx+blockDim.x*24] = tmp24;
    out[idx+blockDim.x*25] = tmp25;
    out[idx+blockDim.x*26] = tmp26;
    out[idx+blockDim.x*27] = tmp27;
    out[idx+blockDim.x*28] = tmp28;
    out[idx+blockDim.x*29] = tmp29;
    out[idx+blockDim.x*30] = tmp30;
    out[idx+blockDim.x*31] = tmp31;
    
}

using mt = double;
int main(){
  size_t sz = 4096;
  size_t msz = sz*sz;

  mt *d_in, *d_out;

  cudaMalloc(&d_in, sizeof(double)*msz);
  cudaMalloc(&d_out, sizeof(double)*msz);

  size_t block_size = 1024;
  dim3 grid_2 = dim3(msz/block_size/32);
  dim3 block = dim3(block_size);
   
copyReg_v2<<<grid_2, block>>>(d_in, d_out );
   cudaDeviceSynchronize();

  cudaFree(d_in);
  cudaFree(d_out);
}

I want to test time of kernel execution, so i compiled code and use nsys to profile it.

nvcc  -arch=sm_80 test.cu ; nsys profile  --stats=true  a.out

However, output of nsys show there is no kernel runing.

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     99.7        126034561          2  63017280.5  63017280.5    104557  125930004   88972026.8  cudaMalloc
      0.2           245705          2    122852.5    122852.5     99468     146237      33070.7  cudaFree
      0.1            98184          1     98184.0     98184.0     98184      98184          0.0  cudaLaunchKernel
      0.0             3827          1      3827.0      3827.0      3827       3827          0.0  cudaDeviceSynchronize
      0.0             1110          1      1110.0      1110.0      1110       1110          0.0  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: /home/xxxx/problem-to-be-solved/data_copy/report3.sqlite does not contain CUDA kernel data.

I tried to debug by cuda-gdb, when i step into kernel, cuda-gdb output:

warning: Cuda API error detected: cudaLaunchKernel returned (0x2bd)

So what’s wrong with my kernel? My device is A800 80GB , CUDA Verison is 12.1

It is good practice to:

  1. Use proper CUDA error checking. I suggest doing that before asking others for help. With good CUDA error checking, you will get a text description of an error, rather than numerical.

  2. run CUDA codes with compute-sanitizer before attempting to use the profilers. What does that tool say when you run your code under it?

0x2bd is 701 which is “too many resources requested for launch”.

A good starting point for that error is to recompile your code with -Xptxas=-v which will cause the compiler to output resource requirements for the kernel. In my case I see that your kernel requires 128 registers (for each thread in a threadblock, CUDA 12.2). When multiplied by 1024 threads per block, that works out to 128K registers needed for the kernel. But no current CUDA GPU has that many registers per SM, so that is the source of the error.

You could (some possibilities):

  1. rewrite the code to reduce register usage. To me how to do this seems self evident. You are creating a lot of local variables, instead do your copying in chunks. (It might also be interesting to investigate the cause of register usage, see below)

  2. Use either the -maxrregcount compiler switch (documented in the NVCC manual), or the __launch_bounds__ directive (documented in the programming guide) to force the compiler to use fewer registers

  3. Reduce the block size to some number smaller than (or perhaps equal to) 65536/128.

Yes, I agree, it does not seem like that many registers should be needed, but that isn’t the crux of your question, that I can see, and even if you only used the obvious 64 registers plus a few more for housekeeping, you would not be able to launch a block with 1024 threads.

CUDA documentation is here. Look along the left hand side to find links to the NVCC manual and the programming guide. You can then do a text search in the NVCC manual for maxrregcount and a text search in the programming guide for __launch_bounds__.

I was able to coax the compiler to use fewer registers with the following kernel refactor:

__global__ void copyReg_v2(double * in,double * out)
{
    int cidx=threadIdx.x+blockDim.x*blockIdx.x*32;
    int idx = cidx;
    double tmp[32];
    #pragma unroll 32
    for (int i = 0; i < 32; i++){
      tmp[i] = in[idx];
      idx += blockIdx.x;}
    idx=cidx;
    #pragma unroll 32
    for (int i = 0; i < 32; i++){
      out[idx] = tmp[i] ;
      idx += blockIdx.x;}
}

That uses 72 registers and I think should be as fast as your original realization. You still can’t launch 1024 threads per block with 72 registers per thread, however. If you limit it to 64 registers per thread, I think you will find local/register spilling happening, and my guess is that it might be slower.