I have a grid set with dim3 GridDim(1,128,1)
Each block is set with dim3 BlockDim(16,1,1)
I allocate a 3D cuda array sized (x=512,y=129,z=128)
Each block writes in his respective 2d “sub”-array of the 3d array (sub array selected with blockIdx.y).
That sub-array is located on the x and z axis. There are 129 sub arrays.
Each thread writes in a cell of that particular sub-array.
The 16 threads of a block write in the sub-array in a coalesced way on the x axis( thread k writes in cell x=k).
Writes are looped until the 128 blocks have ended filling their own sub-array.
When that is done, another kernel is launched with 64 blocks of 1024 threads.
The goal of that kernel is to fill the last (the 129th) sub-array on the y axis.
Each thread receives its own (x,z) coordonates and read through the y axis. Therefore the reads should be coalesced as all the 65536 threads of that kernel read in the same “layer” at the same time.
I would like to know if the coalescence is true on the three axis of a 3d array as i write on the x axis and read on the y axis ?
As long as neighbouring threads within a half-warp access elements at indices with neighbouring x-values and constant y/z values, the memory accesses will be coalesced (with a good pitch that is).
EDIT: This constraint is relaxed a bit for devices of compute capability 1.3, but the y/z values of threads in a half-warp should still remain constant.
N.
No. In order for the memory accesses to coalesce, the threads need to access neighbouring values in the devices memory, which is only the case in the x-direction.
What you could do is allocate some shared memory and load the values from global memory into shared memory in a coalesced way and then perform your operations
in shared memory.
Shared memory can handle access patterns where the threads access elements at various y-values within shared memory space efficiently, as long as you reduce bank conflicts, which shouldn’t be a problem.
Take a look at the transpose example in the CUDA SDK.
I just realized that i was reading on the x axis. :P
As i have a surface of 512*128, that would do 4096 coalesced accesses (of 16) each access on the x axis.
i’ll just loop everything of the value of the y axis.