Kudos for remembering to add “in most cases”!
You can force blocks to be evenly spread by having them use all shared memory (allocate the maximum permissible amount of shared memory per block, minus the static shared memory used, as dynamically allocated shared memory via the third argument of the <<<>>> launch configuration operator), for most compute capabilities.
This scheme may still fail for CCs 3.7, 5.2, 6.1 and 6.2 where the maximum permissible shared memory size per block is half the total available shared memory size of an SM or less.
Another option is to launch more blocks than SMs, discover at runtime the distribution of active blocks amongst SMs, and exit all but one block per SM.
I notice this is getting complicated. But I have successfully used these techniques in the days of Compute Capability 1.x, when it was still possible to outperform the block scheduler with a custom implementation.