I read the launch_bounds function from CUDA Programming Guide and it seems that the example is applied only for one single dimension. As I am using a 2D Kernel function, do you know how to implement the launch bounds into my application?
launch_bounds() doesn’t care about the layout, only the total number of threads. So you can safely use blockDim.xblockDim.yblockDim.z (where blockDim.z==1 in your case) for the number of trheads per block.
Ok. So why I am receiving this compilation warning : CUDACOMPILE : ptxas warning : Value of threads per SM for entry _Z30MyKernelPfS_S_S_PKfS_S_fffbiiS_b is out of range. .minnctapersm will be ignored?
maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive
minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive
You get that error because when you ask for 8 blocks minimum per SM with maximum threads per block equaling 512. So you’re asking for 4096 threads per SM. All architectures allow a maximum of 2048 threads per block, (except 7.5<1024>). Therefore, the compiler is ignoring your minBlocksPerMultiprocessor parameter.
If you were to do the following, the error should go away.
It really depends on the applications. launch_bounds is just an additional way to hand-tune for performance. I would suggest using Nsight Compute to help determine if the number of registers you are using and then go from there.