I’m using a GPU with computing capability 2.1.
In the cuda programming guide, it says if a block is of size (Dx, Dy, Dz), thread ID is computed by x + yDx + zDx*Dy, which means the thread ID increases consecutively along the x-direction, then y and z directions.
But in my test code, I assumed thread ID increases consecutively along the z-direction, so in order to get coalescing, I used thread with 3d index (x,y,z) within the block to read u[z + yDz + xDz*Dy] where u is an array of floats.
As an effect, the two threads with consecutive ID’s are actually accessing the memory locations that are Dz*Dy away in the array u, so the memory accesses are supposed to be almost not coalesced at all.
But surprisingly I didn’t see much difference in performance compared with the code using the correct thread indexing as defined in the programming guide.
Can someone explain the reason here?
Thanks in advance!