I am having to work on 3d data sets of size 60x256x256. To manipulate the array elements, I used a simple cudaMalloc() and cudaMemcpy() {not the 3d functions: does it make that much of a difference?}. Now, to access the elements, I made a block of size (16,16,1) & grid of size [(256/16)*60, 256/16, 1]. Then within the kernel, I wrote:

[codebox]int x,y,z,idx;

y = (blockIdx.x*blockDim.x + threadIdx.x)%256;
x = blockIdx.y*blockDim.y + threadIdx.y;
z = (blockIdx.x*blockDim.x)/256;
idx = z*256*256 + y*256 + x;[/codebox]

And then with the constraint that if(x<256 && y<256 && z<60) and using idx as index, I am accessing the matrix elements. However, I am getting sub-optimal performance (~2x for matrix subtraction: only the kernel - copying time to device not taken into account) in my program and I suspect it might be due to this kind of addressing. Is this kind of accessing coalesced? Is there a better way to access the elements? And is the block size right for my data size? Thanks a lot in advance!

In a half warp, threadIdx.x values are from 0 to 15 => your x and z are constant in each half warp while y is not.

So, if you access memory with “idx” as index value for your array, the global memory accesses won’t be coalesced at all : a half warp will read from array[idx0], , array[idx0 + 256], array[idx0 + 2256], …, array[idx0 + 15256].

Basic rule : threads in a half warp must read contiguous memory locations, it will ensure “not too bad performance” on 1.3 or higher cards.

If you want best performances, you’ll have to use cudaMalloc3D to get aligned accesses (=> fully coalesced if contiguous accesses)

OK. Thanks. I think I got it. For applications involving differences, one doesn’t need the three dimensions at all! I treated them as simple one-D arrays and bingo! the speedup increases dramatically. Thanks a lot!

You could consider 3D textures as well – with the spatial locality out there – you can get even better performance – especially the texture caching will emulate the “blocking” mechanism and if u look @ ur matrices as “blocks”, it is going to work straight away without u having to explicitly block them in shared memory…