All the developer needs to take care is : Spawn just enough blocks keep all the MPs busy… and this number varies from device to device… Spawning logic must take care of it…
That static scheduling technique might help too, but since it doesn’t dynamically assign work to idle SMs you’d still have a lot of inefficiency if one set of blocks happened to be a lot slower than others.
It also requires you to figure out at runtime exactly how many blocks can run simultaneously on your device, which is nontrival (though certainly possible).
I’m surprised you state it does not work. I could understand (and might even expect) that with a little help from the hardware the new scheduling in Fermi outperforms the software implementation. But it’s hard to see why it should not work.