Sometimes having fewer threads per block/SM/device might work better, several reasons such as when there are more threads there may be to few registers per thread. Another one I realised is that in some cases the cache might be overloaded.
The L2 cache on Fermi is 768k and cache lines are 128 bytes, so that is 6*1024 cache lines. That is going to be plenty for most applications but if you had an application where threads have to work on different areas of a global array then each thread would need its own cache line (or maybe even 2-3 cache lines). With an application that needs 1 cache line per thread that means performance may be better by running less than 6216 threads at a time on entire device, or a block size of less than 96 threads per block. Similarly for L1 cache.
Supports what people have been saying “make it easy to change the blocksize, so you can find what works best for your application”
My GPU is too old to microbenchmark this and confirm it.