Global memory access order

I have run into an effect with global memory that I cannot explain. Maybe someone here on the forum can shed some light onto this.

What the attached test case does is copy memory from one array to another. Never mind the efficiency of doing it this way - it is just a small program to demonstrate an odd effect. There are two kernels, both doing identical work. For each kernel, 60 x 16 warps are executed, totally saturating the SMs of my GTX 280 card, yielding 100% occupancy. Every warp loops 2048 times, copying a chunk of memory at a time.

The only difference between the kernels is how warps map to memory locations. In kernel A, the warps copy consecutive memory locations (essentially a row of the array) and then move on to the next row:

[codebox]Processing order (W = warp, T = time slot):

±-------±-------±-------±-------+

| W0, T0 | W1, T0 | W2, T0 | W3, T0 |

±-------±-------±-------±-------+ ||

| W0, T1 | W1, T1 | W2, T1 | W3, T1 | ||

±-------±-------±-------±-------+ Time

| W0, T2 | W1, T2 | W2, T2 | W3, T2 | ||

±-------±-------±-------±-------+ ||

| W0, T3 | W1, T3 | W2, T3 | W3, T3 | /

±-------±-------±-------±-------+[/codebox]

In kernel B, the warps copy a column of the array and then move on to the next column:

[codebox]Processing order (W = warp, T = time slot):

±-------±-------±-------±-------+

| W0, T0 | W0, T1 | W0, T2 | W0, T3 |

±-------±-------±-------±-------+

| W1, T0 | W1, T1 | W1, T2 | W1, T3 |

±-------±-------±-------±-------+ == Time ==>

| W2, T0 | W2, T1 | W2, T2 | W2, T3 |

±-------±-------±-------±-------+

| W3, T0 | W3, T1 | W3, T2 | W3, T3 |

±-------±-------±-------±-------+[/codebox]

There is no difference in coalescing. A set of 32 copy operations that gets handled simultaneously by one warp in kernel A will identically be handled simultaneously by one warp in kernel B. In fact, entire sets of 512 copy operations handled by a block in kernel A are handled by a block in kernel B.

What differs is only in which order these sets of 512 copy operations are executed. The CUDA documentation never talks about this making any difference. Still, the run times for the two kernels are vastly different:

Kernel A execution time: 4.88595

Kernel B execution time: 12.4975

Does anyone have any clue what is going on here? Why does the order in which the copy operations are executed make such a huge difference? Is this some kind of undocumented property of the memory controller?
testcase.cu (2.02 KB)

Sounds like partition camping. There is a discussion about it in the Best Practices guide, but the short version is that memory is segregated in partitions or banks, and (depending on access and memory configuration) it is possible to write code which winds up accessing one partition much more often than the others, which results in poorer memory controller throughput. That column major order version of your kernel is a candidate, depending on exactly what the code does and what the configuration of your card is.

I have read all the documentation there is many times over. For the life of me, I cannot find any mention of partition camping. Can you point me at the relevant section in the Best Practices guide? Given the guide that comes with CUDA 3.0 beta (which is actually the CUDA 2.3 guide), searches for “partition” or “camping” in the PDF return nothing :(.

Edit: A web search uncovered a bit of documentation on partition camping. Why such an important effect would not be documented is beyond me :(.