Setting up 3d arryas I have some questions about how to use 3d arrays and cudaArrays


I am devoloping a method that iteratively updates values in a 3 dimensional vector field, and I would like to use the automatic interpolation from textures (cudaArrays).

As far as I understand CUDA, a way to do this, is to use both an 3d cudaArray allocated using cudaMalloc3DArray tohether with a pointer array allocated with cudaMalloc3D.

In my iterative step, I then read my values from the cudaArray as a texture, write the values to the pointer array, and finally I do a device to device copy of the written content to the cudaArray.

My problem is that it seems a bit odd how the arrays are allocated and how data is copied between them. I represent a 64x64x64 grid of float 4 values like this:

cudaArray *darray = NULL;

const cudaExtent volumeSize = make_cudaExtent(64, 64, 64);

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();

cudaMalloc3DArray(&darray, &desc, volumeSize);

cudaPitchedPtr dPtr;

cudaMalloc3D(&dPtr, volumeSize);

cudaMemset3D(dPtr, 0, volumeSize);

// Make a device to device copy

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr   = dPtr;

copyParams.dstArray = darray;

copyParams.extent   = volumeSize;

copyParams.kind	 = cudaMemcpyDeviceToDevice;


This last copy fails. I geuss it is because the sizes of the arryas do not match.

The cudaArray is allocated with inforamtion about the float4 size, the other one is not. Should I use another volumeSize for the pointer array, like this maybe?

const cudaExtent volumeSize = make_cudaExtent(sizeof(float4)*64, 64, 64);

And finaly, I dont understand the pitch of the cudaPitchedPtr of a 3D array. How am I supposed to use it for correct indexing?

That’s right.

Pitch is fake width. To use it, take an expression such as “array[jwidth + i]" and use "array[jpitch + i]”. Fake width is used to get each row to start on an aligned boundary, even if the true width is an odd size.

Actually cudaextent is in elements, not bytes, so (64,64,64) is right.

As to why it doesnt work, i cant spot it, but have not done it myself. Closest thing i have used is from linear memory to a 3d array.

-edit: ignore me, corrected bellow

For CUDA Arrays, not for 3D linear memory. From the doc: “cudaMalloc3D Allocates at least widthheightdepth bytes of linear memory” where ‘width’ etc is taken from the extent. How is the cudaMalloc3D function supposed to guess the size of the datatype? (cudaMalloc3DArray gets this info from the format descriptor)

Ah my bad i was only looking at the line were cudaMalloc3DArray is used without noticing he is using the cudaextent for the other allocation too!

Carry on!

Great thanks, that makes sense.

Now, when I am doing the device to device copy to the cudaArray like:

// Make a device to device copy

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr   = dPtr;

copyParams.dstArray = darray;

copyParams.extent   = volumeSize;

copyParams.kind	 = cudaMemcpyDeviceToDevice;


The extend must be the small volume (64x64x64) for the copy not to fail. I havent been able to confirm that this is correct, do you know?

How do I use the pitch value to index correctly in my kernels. If I fx have the x,y,z coordinates of the grid, all in [0; 63]. How do I get the proper index of the pointer array?

Would a read from the texture and write to the array look like this?:

int idx = x*sizeof(float4) + y*pitch + z*64*pitch;

array[idx] = tex3D(texref,x,y,z);
  • thanks

You don’t need the sizeof when indexing the texture. Also, idx should actually be a float. Lastly, make sure you’ve bound the texture before accessing it (and set the modes correctly).

I have a similar problem, I’m trying to convert the example convolutionFFT2D to convolutionFFT3D but I get some problems with the memory allocation and copying, I get complaints that there is no constructor for converting from cudaPitchedPtr to Complex *, what is a cudaPitchedPtr anyway?


typedef float2 Complex;










	cudaChannelFormatDesc float2tex 

		= cudaCreateChannelDesc<float2>();

	//Complex *d_PaddedKernel, *d_PaddedData;

	cudaPitchedPtr d_PaddedKernel, d_PaddedData;	

	cufftHandle FFTplan;

	cudaExtent VOLUME_SIZE;


		rCPU, rGPU;


		max_delta_ref, delta, ref, sum_delta2, sum_ref2, L2norm;

	int i, x, y, z;

	unsigned int hTimer;

	// use command-line specified CUDA device, otherwise use device with highest Gflops/s

	if ( cutCheckCmdLineFlag(argc, (const char **)argv, "device"))

		cutilDeviceInit(argc, argv);


		cudaSetDevice( cutGetMaxGflopsDeviceId() );

	cutilCheckError( cutCreateTimer(&hTimer) );

	printf("Input data size		   : %i x %i x %i\n", DATA_W,			 DATA_H, DATA_D			);

	printf("Convolution kernel size   : %i x %i x  %i\n", KERNEL_W,		   KERNEL_H,	   KERNEL_D		  );

	printf("Padded volume size		 : %i x %i x %i\n", DATA_W + PADDING_W, DATA_H + PADDING_H, DATA_D + PADDING_D);

	printf("Aligned padded image size : %i x %i x %i\n", FFT_W,			  FFT_H, FFT_D			 );

	printf("Allocating memory...\n");

		h_Kernel	   = (Complex *)malloc(KERNEL_SIZE);

		h_Data		 = (Complex *)malloc(DATA_SIZE);

		h_ResultCPU	= (Complex *)malloc(DATA_SIZE);

		h_ResultGPU	= (Complex *)malloc(FFT_SIZE);


	//cutilSafeCall( cudaMallocArray(&a_Kernel, &float2tex, KERNEL_W, KERNEL_H) );

		//cutilSafeCall( cudaMallocArray(&a_Data,   &float2tex,   DATA_W,   DATA_H) );


	VOLUME_SIZE = make_cudaExtent(sizeof(Complex) * KERNEL_W, KERNEL_H, KERNEL_D);

		cutilSafeCall( cudaMalloc3DArray(&a_Kernel, &float2tex, VOLUME_SIZE));


	VOLUME_SIZE = make_cudaExtent(sizeof(Complex) * DATA_W, DATA_H, DATA_D);

		cutilSafeCall( cudaMalloc3DArray(&a_Data, &float2tex, VOLUME_SIZE));

	cutilSafeCall( cudaMalloc((void **)&d_PaddedKernel, FFT_SIZE) );

		cutilSafeCall( cudaMalloc((void **)&d_PaddedData,   FFT_SIZE) );


	printf("Generating random input data...\n");


		for(i = 0; i < (KERNEL_W * KERNEL_H * KERNEL_D); i++){

			h_Kernel[i].x = (float)rand() / (float)RAND_MAX;

			h_Kernel[i].y = 0;


		for(i = 0; i < (DATA_W * DATA_H * DATA_D); i++){

			h_Data[i].x = (float)rand() / (float)RAND_MAX;

			h_Data[i].y = 0;


	printf("Creating FFT plan for %i x %i x%i...\n", FFT_W, FFT_H, FFT_D);

		cufftSafeCall( cufftPlan3d(&FFTplan, FFT_H, FFT_W, FFT_D,  CUFFT_C2C) );

	printf("Uploading to GPU and padding convolution kernel and input data...\n");

		printf("...initializing padded kernel and data storage with zeroes...\n");


	//cutilSafeCall( cudaMemset(d_PaddedKernel, 0, FFT_SIZE) );

		//cutilSafeCall( cudaMemset(d_PaddedData,   0, FFT_SIZE) );


	//VOLUME_SIZE = make_cudaExtent(sizeof(Complex) * FFT_W, FFT_H, FFT_D);

	//cutilSafeCall( cudaMemset3D(d_PaddedKernel, 0, VOLUME_SIZE) );

		//cutilSafeCall( cudaMemset3D(d_PaddedData,   0, VOLUME_SIZE) );


		printf("...copying input data and convolution kernel from host to CUDA arrays\n");

		//cutilSafeCall( cudaMemcpyToArray(a_Kernel, 0, 0, h_Kernel, KERNEL_SIZE, cudaMemcpyHostToDevice) );

		//cutilSafeCall( cudaMemcpyToArray(a_Data,   0, 0, h_Data,   DATA_SIZE,   cudaMemcpyHostToDevice) );

	// Make a device to device copy for the kernel volume

	VOLUME_SIZE = make_cudaExtent(sizeof(Complex) * KERNEL_W, KERNEL_H, KERNEL_D);

	cudaMemcpy3DParms copyParams = {0};

	copyParams.srcPtr   = h_Kernel;

	copyParams.dstArray = a_Kernel;

	copyParams.extent   = VOLUME_SIZE;

	copyParams.kind	 = cudaMemcpyDeviceToDevice;


	// Make a device to device copy for the data volume

	VOLUME_SIZE = make_cudaExtent(sizeof(Complex) * DATA_W, DATA_H, DATA_D);

	cudaMemcpy3DParms copyParams = {0};

	copyParams.srcPtr   = h_Data;

	copyParams.dstArray = a_Data;

	copyParams.extent   = VOLUME_SIZE;

	copyParams.kind	 = cudaMemcpyDeviceToDevice;


		printf("...binding CUDA arrays to texture references\n");

		cutilSafeCall( cudaBindTextureToArray(texKernel, a_Kernel) );

		cutilSafeCall( cudaBindTextureToArray(texData,   a_Data)   );

		//Block width should be a multiple of maximum coalesced write size 

		//for coalesced memory writes in padKernel() and padData()

		//dim3 threadBlock(16, 12);

		//dim3 kernelBlockGrid(iDivUp(KERNEL_W, threadBlock.x), iDivUp(KERNEL_H, threadBlock.y));

		//dim3 dataBlockGrid(iDivUp(FFT_W, threadBlock.x), iDivUp(FFT_H, threadBlock.y));


	dim3 threadBlock(16, 12, 12);

		dim3 kernelBlockGrid(iDivUp(KERNEL_W, threadBlock.x), iDivUp(KERNEL_H, threadBlock.y), iDivUp(KERNEL_D, threadBlock.z));

		dim3 dataBlockGrid(iDivUp(FFT_W, threadBlock.x), iDivUp(FFT_H, threadBlock.y), iDivUp(FFT_D, threadBlock.z));

I had quite a bit of trouble figuring this out as well. Here’s what I learned:

When using cudaMalloc3D, a cudaExtent takes as it’s X parameter the dimension of a row of the 3D array in bytes. For instance, if I am allocating an 8 x 4 x 4 array of floats, my cudaExtent is created as follows:

cudaExtent extent = make_cudaExtent(8*sizeof(float), 4, 4)

Note, this dimension has nothing to do with the memory pitch of the array. Next, I can call cudaMalloc3D to create a 3d array of floats:

[codebox]#define X 8

#define Y 4

#define Z 4

cudaPitchedPtr ptr_d;

cudaExtent extent = make_cudaExtent(X*sizeof(float), Y, Z);

cudaMalloc3D(&arr_d, extent);[/codebox]

I can copy into this array using cudaMemcpy3D. To call cudaMemcpy3D I must create cudaMemcpy3DParms (be careful, the function name ends in the string Parms, not the string Params). In the following example code, I will create a host array called arr_h to copy data from. I will copy the contents of arr_h into the host memory that we just malloced, pointed to by the cudaPitchedPtr ptr_d.

[codebox]float arr_h[XYZ];

// Fill arr_h with arbitrary data

// copy from host memory to device memory

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr = make_cudaPitchedPtr((void*)arr_h, X*sizeof(float), X, Y); // arg0 is a pointer to host memory, arg1 is the width of our host array in bytes, arg2 is the width of our host array in elements, arg3 is the height of our host array in elements)

copyParams.dstPtr = arr_d; //cudaPitchedPtr to device memory initialized by cudaMalloc3D

copyParams.extent = extent; //the extent describing our device array

copyParams.kind = cudaMemcpyHostToDevice;


Now, we can use the device array pointed to by the cudaPitchedPtr ptr_d in kernel calls. The following is an example kernel call that takes our 3d array as a parameter. The kernel iterates through all of the values of our 3d array. Notice how we compute the memory offset of our array elements differently than we would if the allocation was completely linear:

[codebox]void global set(cudaPitchedPtr arr_d, cudaExtent extent, float val) {

char* devPtr = (char*)arr_d.ptr;

size_t pitch = arr_d.pitch; //Rows are pitch bytes apart

size_t slicePitch = pitch * extent.height; //Slices are pitch * extent.height bytes apart

for (int z = 0; z < extent.depth; ++z) {

	char* slice = devPtr + z * slicePitch;  //To access slice S, add s * the slicePitch to the original cudaPitchPtr.ptr 

	for (int y = 0; y < extent.height; ++y) {

		float* row = (float*)(slice + y * pitch);

		for (int x = 0; x < (extent.width/sizeof(float)); ++x) {  //Note: extent.width is the row size in bytes, not in elements!

			row[x] = val;





The kernel could be launched as follows:

set<<<1, 1>>>(arr_d, extent, 2.0);

To copy back from device to host, we create a new instance of cudaMemcpy3DParms:

[codebox]cudaMemcpy3DParms copyParamD2H = {0};

copyParamsD2H.srcPtr = arr_d;

copyParamsD2H.dstPtr = make_cudaPitchedPtr((void*)ptr_h, X*sizeof(float), X, Y);

copyParamsD2H.extent = extent;

copyParamsD2H.kind = cudaMemcpyDeviceToHost;



Please let me know if you find any errors in these examples.

Sorry to dredge up an older topic, but this is about an issue I’m currently facing. I’m trying to understand how to use cudaMalloc3D and friends, but the Programming Guide is confusing me. In fact, based on the above listings, I think that the programming guide is incorrect. In version 2.3.1 of the manual (downloaded today), there’s a listing very similar to that given above. However, there’s a declaration

cudaExtent extent = make_cudaExtent(64, 64, 64);

used to allocate the array, which AIUI will allocate 64[sup]3[/sup] bytes of memory. It is then accessed via

char* devPtr = devPitchedPtr.ptr;

size_t pitch = devPitchedPtr.pitch;

size_t slicePitch = pitch * extent.height;

for (int z = 0; z < extent.depth; ++z) {

	char* slice = devPtr + z * slicePitch;

	for (int y = 0; y < extent.height; ++y) {

		float* row = (float*)(slice + y * pitch);

		for (int x = 0; x < extent.width; ++x) {

			float element = row[x];




The innermost loop is going to attempt to iterate over 64 floats, which occupy more than the 64 bytes allocated previously. This loop is going to run off the end of the array.

Based on the above listing, the declaration should be

cudaExtent extent = make_cudaExtent(64*sizeof(float), 64, 64);

and the innermost loop should be

for( int x=0; x< extent.width/sizeof(float); x++ )

Is this correct?