How to fix the maximum number of active blocks per SM

Hi,
i have different versions of a quite complex kernel with heavy arithmetics and memory access.
After a lot of tries i figured out the optimal number of threads per block which actually is a compromise between
descent occupancy and maximum cache and texture hit rate.

I obtain this result :
version 1 use 34 registers with 7 actives blocks per SM and 0.58 occupancy and takes 20ms
version 2 use 40 registers with 6 actives blocks per SM and 0.5 occupancy and takes 18ms and does the same thing as version 1 + some other math work

In my situation improving occupancy above a certain limit (in this particular case is 0.5) or increasing the number of active treads per blocks
only degrade performance due to a lower cache and texture hit rate.

I can control the minimum number of active blocks per SM with launch_bound but what about the maximum number ?

So my question is how can i fix the maximum number of active blocks per SM without changing the number of active threads per block ?

Alexis.

I am little confused I thought that only one block gets to be active for one SM.

Allocate shared memory so that more blocks won’t fit. Can be conveniently done by specifying a nonzero third parameter of the launch configuration.

Thanks, solved and 10% performance improve.