max blocks per sm for sm_75 confusion


Cuda occupancy calculator in 10.1 sdk tells that for sm_75 max blocks per mp is 32.
However, if you pass 32 as minBlocksPerMultiprocessor to launch_bounds, then ptxas complains. In addition, running profiler shows that block limit per mp is 16.
I guess documentation (ie. - occupancy xls) is wrong ?


Maximum number of resident blocks per SM on Turing (cc7.5) is 16:

This is a known bug with the occupancy calculator, but it should have been fixed in the latest 10.1 CUDA release (10.1 U1, i.e. 10.1.168). If not, it should be fixed in the next CUDA release.

Thank You!