thread ID confusion

I’m using a GPU with computing capability 2.1.

In the cuda programming guide, it says if a block is of size (Dx, Dy, Dz), thread ID is computed by x + yDx + zDx*Dy, which means the thread ID increases consecutively along the x-direction, then y and z directions.

But in my test code, I assumed thread ID increases consecutively along the z-direction, so in order to get coalescing, I used thread with 3d index (x,y,z) within the block to read u[z + yDz + xDz*Dy] where u is an array of floats.

As an effect, the two threads with consecutive ID’s are actually accessing the memory locations that are Dz*Dy away in the array u, so the memory accesses are supposed to be almost not coalesced at all.

But surprisingly I didn’t see much difference in performance compared with the code using the correct thread indexing as defined in the programming guide.

Can someone explain the reason here?

Thanks in advance!

Since you are using a compute capability 2.1 device, it is possible that the L1/L2 cache is hiding some of the non-coalesced memory access penalty.

Yeah it seems highly likely, thank you very much for your insights!