What is the reason when running on different GPUs

The figure is a comparison between GTX 260 and GeForce 8800 GTS while use different tile dimensions.
why the best performance always appear with the block height 32 on GeForce 8800 and appear with the block height 4 on GTX 260?
Each thread is used for calculate one pixel in final image by using 4 neighboring pixels in source image.