I don’t think this has anything to do with partition camping… The effective pressure on the memory controllers would be the same… Partition camping hits you when you access columns in a row-major ordered structure or vice versa…
According to the NVIDIA Programming guide the dimensions of the block do not matter, it’s the size that matters. So there should be no difference between 32x8 and 16x16. If the warps in the 16x16 case would only be half full, there would already be performance loss at lower resolution, which does not occur.
Furthermore I do think there is a problem with Partition Camping in this kernel.
On a 8800GT there are 6 partitions (256 byte wide) and 12 microprocessors. The blocks get scheduled in row major fashion.
256 threads per block equals 3 blocks per multiprocessor. So 36 blocks will be scheduled.
With 16x16 blocks: the first 24 blocks will read from the partitions 1-6. The last 12 will read from 1-3.
With 32x8 blocks: the first 12 blocks will read from partitions 1-6, the next 12 too and the next 12 also.
With the 16x16 blocks more reads will always be concentrated on some partitions. I think that causes the kernel to slow down at larger resolutions.
EDIT: I tested with 16x8 blocks and now I get the same result as with the 32x8 case. As the number of blocks per multiprocessor doubles, this leads to less concentrated reads. This should support the fact that partition camping is to blame in this case.