Cuda occupancy calculator in 10.1 sdk tells that for sm_75 max blocks per mp is 32.
However, if you pass 32 as minBlocksPerMultiprocessor to launch_bounds, then ptxas complains. In addition, running profiler shows that block limit per mp is 16.
I guess documentation (ie. - occupancy xls) is wrong ?