Grid dimensions affect performance?! How is this possible?

Moiz_Ahmad · March 19, 2010, 10:17pm

I wrote a cuda kernel that executes in 28 ms on a GT 240 (12 multi-processors).
I decided to scale up the performance by running the same kernel on a Tesla C1060 (30 MP’s). To my dismay the GPU time was LONGER: 35 ms

The kernel size was as follows:
Block size: 16 x 16
Grid size: 64 x 1024

I changed the grid dimensions from 64 x 1024 to 1024 x 64, and the GPU time decreased from 35 ms to 10 ms. This is the kind of performance I was expecting in the first place. But this begs the question:

How can the grid dimensions affect performance? It can’t be a memory coalescing issue since that is related to access operations WITHIN a warp. Any suggestions as to what’s going on??

Thanks!

seibert · March 19, 2010, 10:45pm

Aside from an actual bug in the CUDA scheduler, the only other idea I can come up with would involve caching behavior. Do you access textures or constant memory in your kernel?

tmurray · March 19, 2010, 10:49pm

Did you modify your memory layout? If so, you may be hitting partition camping.

Moiz_Ahmad · March 19, 2010, 11:07pm

I don’t know how to modify the memory layout, so I doubt this is the problem.

Ah, you may be right! I forgot that i do access a texture in this kernel. I will attempt to duplicate this behavior using global memory instead of texture memory. If I can duplicate it, I will definitely get back to you guys. If I cannot, I will assume the behavior is related to caching.

Thanks to all of you.

Jimmy_Pettersson · March 20, 2010, 12:59am

Yeah my first thought was partition camping aswell. I posted a similar issue some months ago ( [url=“http://forums.nvidia.com/index.php?showtopic=106924”]http://forums.nvidia.com/index.php?showtopic=106924[/url] ) which was caused by partition camping.

Check the transposeNew projekt in the SDK for some great examples and good documentation on this issue…

Moiz_Ahmad · March 22, 2010, 11:29pm

Thanks guys,
Although I initially suspected caching behavior, I can CONFIRM that the problem was partition camping.

Block-columnwise vs. block-row wise does make a difference! All the multiprocessors are trying to access the same global memory partition in the block-column wise implemenation. This partition camping problem is worse for GPUs with a large number of multiprocessors (e.g. Tesla).
Partition camping was almost negligible for the GT 240 for this particular kernel.
Moral of the story: arrange your thread blocks in a row-wise manner.

avidday · March 23, 2010, 8:35am

It has nothing to do with the number of mutlitprocessors and everything to do with the number of memory banks. I am willing to bet that your column major order kernel runs much faster on a GTX275 or GTX295 (both 240 cores, just like a Tesla C1060), than on the C1060. The reason is that the GTX275/295 have 7 x 64 bit memory banks (448 bit interface), while the C1060 has 8x64 memory banks (512 bit interface). It is the combination of 8 memory banks and the column major access of half-warps (16 threads) that causes the camping phenomena, not the MP count.

Moiz_Ahmad · March 23, 2010, 6:10pm

Well, I don’t experience partition camping on the GT 220 and GT 240, both 128-bit interfaces (2 x 64-bit banks). If 8 memory banks is a problem, so should 2 banks, since 8 is a multiple of 2. Therefore, in principle, the GT 220 and GT 240 should also suffer. In practice, they do not suffer significantly. These observations are not explained by your reasoning.

avidday · March 23, 2010, 10:48pm

It isn’t my reasoning, it is NVIDIA (explained in the transpose white paper IIRC). But think about it for a second - if you have partition camping in a device with two memory banks, the worst case maximum bandwidth is 50% of the theoretical. If you have an 8 bank interface, the worst case maximum bandwidth is 12.5% of the theoretical maximum. Which case do you think might have a more noticeable effect on kernel performance?

Topic		Replies	Views
Partition camping ? CUDA Programming and Performance	5	11072	March 23, 2010
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61632	September 4, 2008
16x16 VS 32x8 large difference -> bug? CUDA Programming and Performance	6	1082	May 2, 2011
Partition Camping for Memory Bandwidth Optimization CUDA Programming and Performance	3	1396	May 6, 2010
Global memory read interference? CUDA Programming and Performance	2	1082	November 10, 2009
Grid dimensions CUDA Programming and Performance	6	5685	September 18, 2009
Performance drop CUDA Programming and Performance	1	639	July 17, 2011
process files in CUDA kernel CUDA Programming and Performance	2	4177	May 10, 2010
no bank conflicts on gtx285 or partition camping on quadro nvs 140m? CUDA Programming and Performance	1	3243	March 27, 2010
Unexpected partition camping on Tesla S1070 CUDA Programming and Performance	0	2914	February 16, 2011

Grid dimensions affect performance?! How is this possible?

Related topics