The effect of hardware MP count is mostly hidden from you… you don’t really have to analyze the way the device maps your blocks to multiprocessors. You CAN if you want to by querying the device properties, but you don’t have much control and in fact you shouldn’t try to do too many games.
The general way to allow scaling is to make sure to have lots of blocks in your grid. The overhead of a block launch is small (but admittedly nonnegligable). But more blocks gives the device better granularity for running them in parallel.
So you may have 16 multiprocessors, and think that using 16 blocks is best… but maybe not! If you used 32 blocks, perhaps (depending on your kernel’s resource use) two blocks can run on each processor simultaneously, giving a speed boost. And if you do use 16 blocks, and 10 of them finish and 6 are still cooking, 10 multiprocessors will sit idle, waiting for the remaining 6.
If you have 400 blocks, this is all solved for you, you’ll get finer grained block scheduling… they just get queued up and every MP stays busy until the very end.
More blocks is also robust to device changes. If you have 16 blocks, it makes a big difference if your device has 12 MPs or 16, it could easily be half speed on the 12 MP device because 4 blocks have to wait. But if you have 400 blocks, your idle MP overhead is negligible and you’ll be ussing both devices to their full advantage. And even when tomorrow Nvidia releases their new 72 MP monsterboard, your high-block project will be ready to run efficiently.
Now you did ask another question, how can you TEST your program running on different numbers of MPs? The answer is you should buy hardware with a variety. :-) I’m sure the GPU firmware could be modified to use only a subset of your MPs but that’s not something we have easy access to.