Launch bounds with 2D Threads dimension (x,y)


I read the launch_bounds function from CUDA Programming Guide and it seems that the example is applied only for one single dimension. As I am using a 2D Kernel function, do you know how to implement the launch bounds into my application?


launch_bounds() doesn’t care about the layout, only the total number of threads. So you can safely use blockDim.xblockDim.yblockDim.z (where blockDim.z==1 in your case) for the number of trheads per block.

Ok. So why I am receiving this compilation warning : CUDACOMPILE : ptxas warning : Value of threads per SM for entry _Z30MyKernelPfS_S_S_PKfS_S_fffbiiS_b is out of range. .minnctapersm will be ignored?

  • maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive
  • minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive

I was able to reproduce the error doing this…

__launch_bounds__( 512, 8 ) __global__ void doMath() {...}

You get that error because when you ask for 8 blocks minimum per SM with maximum threads per block equaling 512. So you’re asking for 4096 threads per SM. All architectures allow a maximum of 2048 threads per block, (except 7.5<1024>). Therefore, the compiler is ignoring your minBlocksPerMultiprocessor parameter.

If you were to do the following, the error should go away.

__launch_bounds__( 512, 4 ) __global__ void doMath() {...}

Thanks !!!

Indeed, I have a RTX 2060 (7.5). So, i get this error when exceeding the maximum 1024.

  1. Do you know why the limit was lowered for Turing architecture?
  2. For my case, is it still worth to use Launch bounds?

It really depends on the applications. launch_bounds is just an additional way to hand-tune for performance. I would suggest using Nsight Compute to help determine if the number of registers you are using and then go from there.

Ok thanks