I am processing a 3-d volume of data. The data is organized so x varies fastest and z slowest - linear address = z * nx * ny + y * nx + x.

I have three different kernels that access the data along each axis. Each block accesses the data and varies only one of the x, y or z. In other words, one kernel processes all (x,y) vectors nz long, another all (x,z) vectors ny long and another all (y,z) vectors nx long. The kernels are virtually identical with only changes to move data from global memory to shared.

I get different times for each of the three kernels as follows:

Kernel x gputime 897319

Kernel y gputime 277600

Kernel x gputime 169130

I know the kernel that works with varying x will be the fastest since the global memory accesses can be coalesced and the one with varying z will be the slowest - that’s supported by the data above.

What I don’t understand is why the kernel for varying y has times in between that of the x and z. The data is still far apart in global memory so it should appear just like the kernel for z and have the same time - so why doesn’t it?

Thanks,