https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
-
maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive
-
minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive
I was able to reproduce the error doing this…
__launch_bounds__( 512, 8 ) __global__ void doMath() {...}
You get that error because when you ask for 8 blocks minimum per SM with maximum threads per block equaling 512. So you’re asking for 4096 threads per SM. All architectures allow a maximum of 2048 threads per block, (except 7.5<1024>). Therefore, the compiler is ignoring your minBlocksPerMultiprocessor parameter.
If you were to do the following, the error should go away.
__launch_bounds__( 512, 4 ) __global__ void doMath() {...}