You usually want more blocks in general for efficiency robustness and scaling.
First, if your blocks take a variable amount of time each, then you’ll have idle SMs since your kernel runtime will be defined by the very slowest block in your kernel.
If you have many more blocks than SMs, then on Fermi your kernel runtime will be defined by the AVERAGE runtime of all blocks (which is ideal.)
Second, more blocks allow better scaling across various hardware. Perhaps a new GPU has more SMs or can run more blocks per SM… but you hardwired the block count to be lower than that number because you were assuming some other GPU as a reference. So you lose horsepower by leaving part of the GPU idle.
Finally, don’t be scared of high block counts. The overhead of launching a new block is quite small. While I have not timed it, it’s likely on the order of tens of clock cycles, not millions.
Last, a very very large (almost hypocritical) caveat: despite all I just posted, I actually don’t follow the above advice of “use lots of blocks” because of GT200 block scheduling inefficiencies… I dynamically schedule my own work inside each block (using atomic queues) instead of letting the GPU do dynamic block assignments. This isn’t necessary on Fermi.