Observation about performance change with change in grid size

Hi all,

I am using a 8800GT for an image processing algorithm. Earlier the grid size for my application was 128128 with blocksize 88 (this block size gives the best performance) which covers the whole image (1024*1024). Each block performs a parallel process on an image portion.

Suppose in this configuration the kernel run time is X.
Now I am doing some scalability tests on my algorithm. So I force the whole task to be done by lesser number of blocks. Suppose there is just one block in the grid then this block is forced to perform task of all the 128*128 blocks serially. (using a for loop in the kernel).

So as I increase the number of blocks the performance scales up in the following way :

No. of Block Time taken
1 Block = 25X
2 Block = 13X
4 Block = 6.5X
8 Block = 3.3X
16 Block = 1.6X
32 Block = X
64 Block = .75X

Question (finally :-) !! ) : Why is there a performance speed-up when I keep 64 blocks make the kernel repeat the work on different parts of image? i.e. why is there a performance benifit when i have lesser number of blocks (64) against 128*128 for doing the same job. Is it not conflicting to the general notion that more number of blocks yield better performance?