i have different versions of a quite complex kernel with heavy arithmetics and memory access.
After a lot of tries i figured out the optimal number of threads per block which actually is a compromise between
descent occupancy and maximum cache and texture hit rate.
I obtain this result :
version 1 use 34 registers with 7 actives blocks per SM and 0.58 occupancy and takes 20ms
version 2 use 40 registers with 6 actives blocks per SM and 0.5 occupancy and takes 18ms and does the same thing as version 1 + some other math work
In my situation improving occupancy above a certain limit (in this particular case is 0.5) or increasing the number of active treads per blocks
only degrade performance due to a lower cache and texture hit rate.
I can control the minimum number of active blocks per SM with launch_bound but what about the maximum number ?
So my question is how can i fix the maximum number of active blocks per SM without changing the number of active threads per block ?