reads in a 3d cuda array Can reads/writes be coalesced on the 3 axis ?

Hi all,

I have a grid set with dim3 GridDim(1,128,1)
Each block is set with dim3 BlockDim(16,1,1)

I allocate a 3D cuda array sized (x=512,y=129,z=128)

Each block writes in his respective 2d “sub”-array of the 3d array (sub array selected with blockIdx.y).
That sub-array is located on the x and z axis. There are 129 sub arrays.
Each thread writes in a cell of that particular sub-array.
The 16 threads of a block write in the sub-array in a coalesced way on the x axis( thread k writes in cell x=k).
Writes are looped until the 128 blocks have ended filling their own sub-array.

When that is done, another kernel is launched with 64 blocks of 1024 threads.
The goal of that kernel is to fill the last (the 129th) sub-array on the y axis.
Each thread receives its own (x,z) coordonates and read through the y axis. Therefore the reads should be coalesced as all the 65536 threads of that kernel read in the same “layer” at the same time.

I would like to know if the coalescence is true on the three axis of a 3d array as i write on the x axis and read on the y axis ?

Thanks ^^

As long as neighbouring threads within a half-warp access elements at indices with neighbouring x-values and constant y/z values, the memory accesses will be coalesced (with a good pitch that is).

EDIT: This constraint is relaxed a bit for devices of compute capability 1.3, but the y/z values of threads in a half-warp should still remain constant.

As i Have Compute capability 1.0, i’ll stick to the first one :)

So would the coalescence be kept with variable and neighboring y-values and constant x/z instead of variable x and constant y/z ?

No. In order for the memory accesses to coalesce, the threads need to access neighbouring values in the devices memory, which is only the case in the x-direction.

What you could do is allocate some shared memory and load the values from global memory into shared memory in a coalesced way and then perform your operations

in shared memory.

Shared memory can handle access patterns where the threads access elements at various y-values within shared memory space efficiently, as long as you reduce bank conflicts, which shouldn’t be a problem.

Take a look at the transpose example in the CUDA SDK.


I just realized that i was reading on the x axis. :P
As i have a surface of 512*128, that would do 4096 coalesced accesses (of 16) each access on the x axis.

i’ll just loop everything of the value of the y axis.

Thanks for the help !!! :)

I’m not sure i’ve seen any information about writing in a 3D cuda array from kernel…

Is that possible ? :huh:

No, you can’t write to cuda arrays, only linear memory.


<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Thanks for the tip !

Don’t know if this helps you but, from the programming guide

The following code sample allocates a width×height×depth 3D array of

floating-point values and shows how to loop over the array elements in device code:

// Host code

cudaPitchedPtr devPitchedPtr;

cudaExtent extent = make_cudaExtent(64, 64, 64);

cudaMalloc3D(&devPitchedPtr, extent);

myKernel<<<100, 512>>>(devPitchedPtr, extent);

// Device code

__global__ void myKernel(cudaPitchedPtr devPitchedPtr,

cudaExtent extent)


	char* devPtr = devPitchedPtr.ptr;

	size_t pitch = devPitchedPtr.pitch;

	size_t slicePitch = pitch * extent.height;

	for (int z = 0; z < extent.depth; ++z) 


		char* slice = devPtr + z * slicePitch;

		for (int y = 0; y < extent.height; ++y) 


			float* row = (float*)(slice + y * pitch);

			for (int x = 0; x < extent.width; ++x) 


				float element = row[x];