I have run into an effect with global memory that I cannot explain. Maybe someone here on the forum can shed some light onto this.
What the attached test case does is copy memory from one array to another. Never mind the efficiency of doing it this way - it is just a small program to demonstrate an odd effect. There are two kernels, both doing identical work. For each kernel, 60 x 16 warps are executed, totally saturating the SMs of my GTX 280 card, yielding 100% occupancy. Every warp loops 2048 times, copying a chunk of memory at a time.
The only difference between the kernels is how warps map to memory locations. In kernel A, the warps copy consecutive memory locations (essentially a row of the array) and then move on to the next row:
[codebox]Processing order (W = warp, T = time slot):
±-------±-------±-------±-------+
| W0, T0 | W1, T0 | W2, T0 | W3, T0 |
±-------±-------±-------±-------+ ||
| W0, T1 | W1, T1 | W2, T1 | W3, T1 | ||
±-------±-------±-------±-------+ Time
| W0, T2 | W1, T2 | W2, T2 | W3, T2 | ||
±-------±-------±-------±-------+ ||
| W0, T3 | W1, T3 | W2, T3 | W3, T3 | /
±-------±-------±-------±-------+[/codebox]
In kernel B, the warps copy a column of the array and then move on to the next column:
[codebox]Processing order (W = warp, T = time slot):
±-------±-------±-------±-------+
| W0, T0 | W0, T1 | W0, T2 | W0, T3 |
±-------±-------±-------±-------+
| W1, T0 | W1, T1 | W1, T2 | W1, T3 |
±-------±-------±-------±-------+ == Time ==>
| W2, T0 | W2, T1 | W2, T2 | W2, T3 |
±-------±-------±-------±-------+
| W3, T0 | W3, T1 | W3, T2 | W3, T3 |
±-------±-------±-------±-------+[/codebox]
There is no difference in coalescing. A set of 32 copy operations that gets handled simultaneously by one warp in kernel A will identically be handled simultaneously by one warp in kernel B. In fact, entire sets of 512 copy operations handled by a block in kernel A are handled by a block in kernel B.
What differs is only in which order these sets of 512 copy operations are executed. The CUDA documentation never talks about this making any difference. Still, the run times for the two kernels are vastly different:
Kernel A execution time: 4.88595
Kernel B execution time: 12.4975
Does anyone have any clue what is going on here? Why does the order in which the copy operations are executed make such a huge difference? Is this some kind of undocumented property of the memory controller?
testcase.cu (2.02 KB)