Grid dimensions affect performance?! How is this possible?

I wrote a cuda kernel that executes in 28 ms on a GT 240 (12 multi-processors).
I decided to scale up the performance by running the same kernel on a Tesla C1060 (30 MP’s). To my dismay the GPU time was LONGER: 35 ms

The kernel size was as follows:
Block size: 16 x 16
Grid size: 64 x 1024

I changed the grid dimensions from 64 x 1024 to 1024 x 64, and the GPU time decreased from 35 ms to 10 ms. This is the kind of performance I was expecting in the first place. But this begs the question:

How can the grid dimensions affect performance? It can’t be a memory coalescing issue since that is related to access operations WITHIN a warp. Any suggestions as to what’s going on??

Thanks!

Aside from an actual bug in the CUDA scheduler, the only other idea I can come up with would involve caching behavior. Do you access textures or constant memory in your kernel?

Did you modify your memory layout? If so, you may be hitting partition camping.

I don’t know how to modify the memory layout, so I doubt this is the problem.

Ah, you may be right! I forgot that i do access a texture in this kernel. I will attempt to duplicate this behavior using global memory instead of texture memory. If I can duplicate it, I will definitely get back to you guys. If I cannot, I will assume the behavior is related to caching.

Thanks to all of you.

Yeah my first thought was partition camping aswell. I posted a similar issue some months ago ( http://forums.nvidia.com/index.php?showtopic=106924 ) which was caused by partition camping.

Check the transposeNew projekt in the SDK for some great examples and good documentation on this issue…

Thanks guys,
Although I initially suspected caching behavior, I can CONFIRM that the problem was partition camping.

Block-columnwise vs. block-row wise does make a difference! All the multiprocessors are trying to access the same global memory partition in the block-column wise implemenation. This partition camping problem is worse for GPUs with a large number of multiprocessors (e.g. Tesla).
Partition camping was almost negligible for the GT 240 for this particular kernel.
Moral of the story: arrange your thread blocks in a row-wise manner.

It has nothing to do with the number of mutlitprocessors and everything to do with the number of memory banks. I am willing to bet that your column major order kernel runs much faster on a GTX275 or GTX295 (both 240 cores, just like a Tesla C1060), than on the C1060. The reason is that the GTX275/295 have 7 x 64 bit memory banks (448 bit interface), while the C1060 has 8x64 memory banks (512 bit interface). It is the combination of 8 memory banks and the column major access of half-warps (16 threads) that causes the camping phenomena, not the MP count.

Well, I don’t experience partition camping on the GT 220 and GT 240, both 128-bit interfaces (2 x 64-bit banks). If 8 memory banks is a problem, so should 2 banks, since 8 is a multiple of 2. Therefore, in principle, the GT 220 and GT 240 should also suffer. In practice, they do not suffer significantly. These observations are not explained by your reasoning.

It isn’t my reasoning, it is NVIDIA (explained in the transpose white paper IIRC). But think about it for a second - if you have partition camping in a device with two memory banks, the worst case maximum bandwidth is 50% of the theoretical. If you have an 8 bank interface, the worst case maximum bandwidth is 12.5% of the theoretical maximum. Which case do you think might have a more noticeable effect on kernel performance?