You can force blocks to be evenly spread by having them use all shared memory (allocate the maximum permissible amount of shared memory per block, minus the static shared memory used, as dynamically allocated shared memory via the third argument of the <<<>>> launch configuration operator), for most compute capabilities.
This scheme may still fail for CCs 3.7, 5.2, 6.1 and 6.2 where the maximum permissible shared memory size per block is half the total available shared memory size of an SM or less.
Another option is to launch more blocks than SMs, discover at runtime the distribution of active blocks amongst SMs, and exit all but one block per SM.
I notice this is getting complicated. But I have successfully used these techniques in the days of Compute Capability 1.x, when it was still possible to outperform the block scheduler with a custom implementation.
It feels really niched or missguided when people start worrying about block scheduling (not saying that it can’t give performance improvements).
I have vague memories of working on the GT200 in 2009, where I managed to store previous block ID:s in shared memory that could be picked by the next block scheduled to determine what it should execute on. Not very robust to say the least :-)
I never tried passing data in shared memory between blocks. I didn’t have to, because with the custom block scheduler the block would only exit once all work was done.
IIRC the CC 1.x block scheduler was strictly round-robin. So anything that took actual load balance into account would beat it, even given that it had to use some global memory atomics.
And I have no problem being described as niche - that’s where I’ve been almost all live.