Cuda API error detected: cudaLaunchKernel returned (0x2bd)

It is good practice to:

  1. Use proper CUDA error checking. I suggest doing that before asking others for help. With good CUDA error checking, you will get a text description of an error, rather than numerical.

  2. run CUDA codes with compute-sanitizer before attempting to use the profilers. What does that tool say when you run your code under it?

0x2bd is 701 which is “too many resources requested for launch”.

A good starting point for that error is to recompile your code with -Xptxas=-v which will cause the compiler to output resource requirements for the kernel. In my case I see that your kernel requires 128 registers (for each thread in a threadblock, CUDA 12.2). When multiplied by 1024 threads per block, that works out to 128K registers needed for the kernel. But no current CUDA GPU has that many registers per SM, so that is the source of the error.

You could (some possibilities):

  1. rewrite the code to reduce register usage. To me how to do this seems self evident. You are creating a lot of local variables, instead do your copying in chunks. (It might also be interesting to investigate the cause of register usage, see below)

  2. Use either the -maxrregcount compiler switch (documented in the NVCC manual), or the __launch_bounds__ directive (documented in the programming guide) to force the compiler to use fewer registers

  3. Reduce the block size to some number smaller than (or perhaps equal to) 65536/128.

Yes, I agree, it does not seem like that many registers should be needed, but that isn’t the crux of your question, that I can see, and even if you only used the obvious 64 registers plus a few more for housekeeping, you would not be able to launch a block with 1024 threads.

CUDA documentation is here. Look along the left hand side to find links to the NVCC manual and the programming guide. You can then do a text search in the NVCC manual for maxrregcount and a text search in the programming guide for __launch_bounds__.