copy 3D data from host to device

Hi Guys,

I am trying to code a cuda program to do a PDE solving. I need to first copy a 3D array from host memory to device memory.

I have the following code:

float *phi = new float[DimX*DimY*DimZ];

// some computations on phi ;

extern "C"

void ReinitializeCUDA(float *phi, const char *obj_cpu, const char *movobj_cpu, const char *source,

 bool init, int iterations, int I, int J, int K,

 	float delta, float dtau, float eps, float limit,

 int dim[]){

	// dim[0] = DimX, dim[1] = DimY, dim[2] = DimZ

cudaPitchedPtr phi_gpu1;

	cudaExtent ca_extent = make_cudaExtent(dim[0]*sizeof(float), dim[1], dim[2]);	

	cudaMalloc3D( &phi_gpu1, ca_extent); 

	cudaMemset3D( phi_gpu1, 0, ca_extent);

	cudaMemcpy3DParms cpy_params = {0};

 *****cpy_params.srcPtr = make_cudaPitchedPtr( (void*)phi, dim[0] * sizeof(float), dim[1], dim[2] );

	cpy_params.dstPtr = phi_gpu1;

	cpy_params.extent = ca_extent;

	cpy_params.kind 	= cudaMemcpyHostToDevice;

	cudaMemcpy3D( &cpy_params );	

}

Later on, I will copy back the results obtained from a kernel to phi as below:

cudaMemcpy3DParms dhcpy_params = {0};

	dhcpy_params.srcPtr = phi_gpu1;

 *****dhcpy_params.dstPtr = make_cudaPitchedPtr( (void*)phi, dim[0] * sizeof(float), dim[1], dim[2] );

	dhcpy_params.extent = ca_extent;

	dhcpy_params.kind 	= cudaMemcpyDeviceToHost;

	cudaMemcpy3D( &dhcpy_params );

	printf("cudaMemcpy3D: %s\n", cudaGetErrorString(cudaGetLastError()));

My question is about the two lines of code beginning with ****. Shall I use

cpy_params.srcPtr = make_cudaPitchedPtr( (void*)phi, dim[0] * sizeof(float), dim[1], dim[2] );

// ....

dhcpy_params.dstPtr = make_cudaPitchedPtr( (void*)phi, dim[0] * sizeof(float), dim[1], dim[2] );

or

cpy_params.srcPtr = make_cudaPitchedPtr( (void*)phi, dim[0] * sizeof(float), dim[0], dim[1] );

// ....

dhcpy_params.dstPtr = make_cudaPitchedPtr( (void*)phi, dim[0] * sizeof(float), dim[0], dim[1] );

Or to put it in another way, how to properly copy a 3D array from host to device (pointed to by a cudaPitchedPtr).

Thanks.

merlin

You must use second suggestion

but also, this is wrong: cudaExtent ca_extent = make_cudaExtent(dim[0]*sizeof(float), dim[1], dim[2]);
it should be: cudaExtent ca_extent = make_cudaExtent(dim[0], dim[1], dim[2]);

Thanks for your reply. I also think the second approach is correct from the GPU SDK examples. Just curious about how CUDA runtime knows the 3rd dimension and copies enough data.

Are you sure about this? From the Ref Manual,

<a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__TYPES_gf599e5b8b829ce7db0f5216928f6ecb6.html#gf599e5b8b829ce7db0f5216928f6ecb6"'>cudaError_t</a> cudaMalloc3D  ( struct <a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/structcudaPitchedPtr.html"'>cudaPitchedPtr</a> *   <i>pitchedDevPtr</i>,    

struct <a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/structcudaExtent.html"'>cudaExtent</a>   <i>extent</i>    

 ) 

Allocates at least width * height * depth bytes of linear memory on the device and returns a <a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/structcudaPitchedPtr.html"'>cudaPitchedPtr</a> in which ptr is a pointer to the allocated memory. The function may pad the allocation to ensure hardware alignment requirements are met. The pitch returned in the pitch field of pitchedDevPtr is the width in bytes of the allocation.

The returned <a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/structcudaPitchedPtr.html"'>cudaPitchedPtr</a> contains additional fields xsize and ysize, the logical width and height of the allocation, which are equivalent to the width and height extent parameters provided by the programmer during allocation.

For allocations of 2D and 3D objects, it is highly recommended that programmers perform allocations using <a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g04a7553c90322aef32f8544d5c356a10.html#g04a7553c90322aef32f8544d5c356a10"'>cudaMalloc3D()</a> or <a target='_blank' rel='noopener noreferrer' href='"http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g80d689bc903792f906e49be4a0b6d8db.html#g80d689bc903792f906e49be4a0b6d8db"'>cudaMallocPitch()</a>. Due to alignment restrictions in the hardware, this is especially true if the application will be performing memory copies involving 2D or 3D objects (whether linear memory or CUDA arrays).

<b>Parameters:</b>  

<i>pitchedDevPtr</i> - Pointer to allocated pitched device memory  

<i>extent</i> - Requested allocation size

It says “allocates at least width * height * depth bytes of linear memory on the device and returns a cudaPitchedPtr in which ptr is a pointer to the allocated memory”.

If you use

cudaExtent ca_extent = make_cudaExtent(dim[0], dim[1], dim[2]);

your memory can only hold (dim[0]*dim[1]*dim[2]) bytes of data, while I need (dim[0]*dim[1]*dim[2])sizeof(float) which is 4(dim[0]*dim[1]*dim[2]) bytes. Or am I missing something here?

Ahh, you might be right. I use cudaMalloc3DArray, to which you also supply channel descriptor (so it knows element’s size).