Launch bounds with 2D Threads dimension (x,y)

Abdopensky · December 19, 2019, 3:30pm

Hello,

I read the launch_bounds function from CUDA Programming Guide and it seems that the example is applied only for one single dimension. As I am using a 2D Kernel function, do you know how to implement the launch bounds into my application?

Thanks

tera · December 19, 2019, 3:44pm

launch_bounds() doesn’t care about the layout, only the total number of threads. So you can safely use blockDim.xblockDim.yblockDim.z (where blockDim.z==1 in your case) for the number of trheads per block.

Abdopensky · December 19, 2019, 3:50pm

Ok. So why I am receiving this compilation warning : CUDACOMPILE : ptxas warning : Value of threads per SM for entry _Z30MyKernelPfS_S_S_PKfS_S_fffbiiS_b is out of range. .minnctapersm will be ignored?

mnicely · December 22, 2019, 11:07pm

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive
minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive

I was able to reproduce the error doing this…

__launch_bounds__( 512, 8 ) __global__ void doMath() {...}

You get that error because when you ask for 8 blocks minimum per SM with maximum threads per block equaling 512. So you’re asking for 4096 threads per SM. All architectures allow a maximum of 2048 threads per block, (except 7.5<1024>). Therefore, the compiler is ignoring your minBlocksPerMultiprocessor parameter.

If you were to do the following, the error should go away.

__launch_bounds__( 512, 4 ) __global__ void doMath() {...}

Abdopensky · December 23, 2019, 1:19am

Thanks !!!

Indeed, I have a RTX 2060 (7.5). So, i get this error when exceeding the maximum 1024.

Do you know why the limit was lowered for Turing architecture?
For my case, is it still worth to use Launch bounds?

mnicely · December 23, 2019, 2:26am

It really depends on the applications. launch_bounds is just an additional way to hand-tune for performance. I would suggest using Nsight Compute to help determine if the number of registers you are using and then go from there.

Abdopensky · December 23, 2019, 2:41am

Ok thanks