Problem about cudaMemcpy3D()

Hi everybody.

i have a problem about cudaMemcpy3D().

when I use this function to copy data from host to global memory. it is OK

int numOfElements = dimx *dimy *dimz;

float* hostInput = 0;

	hostInput = new float[numOfElements];

	for(int i = 0; i < numOfElements; ++i)

	{

  hostInput[i] = (float)i;

	}

float* deviceData =	0;

	cudaPitchedPtr pitchPtr = make_cudaPitchedPtr( (void*)deviceData, dimx*sizeof(float), dimx, dimy );

	printf("cudaPitchedPtr: %s\n", cudaGetErrorString(cudaGetLastError()));

	cudaExtent ca_extent;

	ca_extent.width  = dimx;

	ca_extent.height = dimy;

	ca_extent.depth  = dimz;

	cudaMalloc3D( &pitchPtr, ca_extent);

	cudaMemset3D( pitchPtr, 0, ca_extent);

	cudaMemcpy3DParms cpy_params = {0};

	cpy_params.srcPtr   = make_cudaPitchedPtr( (void*)hostInput, dimx * sizeof(float), dimx, dimy );

	cpy_params.dstPtr   = pitchPtr;

	cpy_params.extent   = ca_extent;

	cpy_params.kind     = cudaMemcpyHostToDevice;

	cudaMemcpy3D( &cpy_params );

	printf("cudaMemcpy3D: %s\n", cudaGetErrorString(cudaGetLastError()));

but when I use this function to copy data from host to cudaArray, i get a problem.

“unspecified launch failure”

int numOfElements = dimx *dimy *dimz;

float* hostInput = 0;

	hostInput = new float[numOfElements ];

	for(int i = 0; i < numOfElements; ++i)

	{

  hostInput[i] = (float)i;

	}

      cudaChannelFormatDesc ca_descriptor = cudaCreateChannelDesc<float>();

	cudaExtent ca_extent;

	ca_extent.width  = dimx;

	ca_extent.height = dimy;

	ca_extent.depth  = dimz;

	cudaArray *cudaArrayData = 0;

	cudaMalloc3DArray( &cudaArrayData, &ca_descriptor, ca_extent );

	cudaBindTextureToArray( textureRef, cudaArrayData, ca_descriptor );

	cudaMemcpy3DParms cpy_params = {0};

	cpy_params.srcPtr   = make_cudaPitchedPtr( (void*)hostInput, dimx * sizeof(float), dimx, dimy );

	cpy_params.extent   = ca_extent;

	cpy_params.kind     = cudaMemcpyHostToDevice;

	cpy_params.dstArray = cudaArrayData;

	CUDA_SAFE_CALL( cudaMemcpy3D(&cpy_params) );

	printf("cudaMemcpy3D: %s\n", cudaGetErrorString(cudaGetLastError()));

Hi,

The function documentation for cudaMalloc3D and cudaMemcpy3D seem quite confusing.

Here is my code:

Allocation:

cudaExtent pitchedVolSize = make_cudaExtent(subvolSize.width*sizeof(float), subvolSize.height, subvolSize.depth);

  CUDA_SAFE_CALL(cudaMalloc3D(&d_volPPtr, pitchedVolSize));

memcpy:

 cudaMemcpy3DParms copyParams = {0};

  copyParams.srcPtr = d_volPPtr;

  copyParams.dstPtr = make_cudaPitchedPtr((void*)h_subvol, (SUBVOL_DIM-2)*sizeof(float), SUBVOL_DIM-2, SUBVOL_DIM-2);

  copyParams.kind = cudaMemcpyDeviceToHost;

  copyParams.extent = make_cudaExtent((SUBVOL_DIM-2)*sizeof(float), SUBVOL_DIM-2, SUBVOL_DIM-2);

  copyParams.srcPos = make_cudaPos(1*sizeof(float), 1, 1); // We want to copy (SUBVOL_DIM-2)^3 startint at (1, 1, 1). 

 CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

The documentation says that if the transfer doesn’t include cuda arrays, then the width parameter is always the width in bytes!

I’m not very sure if my code works, but would definitely like to have some good examples of mem allocation in 3D and 3D memcpy, specially from device to host.

-Ojas

Hi,

I just figured out how to perform 3D memory transfers (and subvolume copies). If someone wants the code, I can provide.

Cheers,
Oj

Hey,

Could you please mail me the code (or paste it here) maybe with a small description…I’m running 700700700 arrays and really need this. Also if you have some code wherein blocking is done on them, that’ll be great!

Cheers,

Vandhan!

This is how i do it.

cudaChannelFormatDesc tfloatTex = cudaCreateChannelDesc<short>();;

const cudaExtent tvolumeSize = make_cudaExtent(pdose->terma->x_dim, pdose->terma->y_dim, pdose->terma->z_dim);

	CUDA_SAFE_CALL( cudaMalloc3DArray(&tArray, &tfloatTex, tvolumeSize) );

	cudaMemcpy3DParms tcopyParams = {0};

	tcopyParams.srcPtr �  = make_cudaPitchedPtr((void*)pdose->terma->data, tvolumeSize.width*sizeof(short), tvolumeSize.width, tvolumeSize.height);

	tcopyParams.dstArray = tArray;

 �   tcopyParams.extent �  = tvolumeSize;

 �   tcopyParams.kind �  �  = cudaMemcpyHostToDevice;

 �   CUDA_SAFE_CALL( cudaMemcpy3D(&tcopyParams) );

Hope this helps

Hey thanks a ton!

Another doubt…Something very similiar…Say I have a n^3 array (cube) which I need to break up into smaller arrays (cubes) of p^3 size (i.e. I have to do blocking!). The small p^3 block is worked on and another p^3 block generated in the device kernel. I realised copying the full n^3 array into global space using solution above would be very inefficient and not scalable. ( Sizes start at 700700700 elements)

The other solution is passing p^3 elements to the block and getting back p^3 element. How do I do this in CUDA? [ Small code with function parameters, etc would be helpful ]When I pass these as function parameters to kernel, don’t they get copied into shared memory ? (i.e. only about 16 KB).

Also finally once I have got these p^3 cubes (from n^3/p^3 blocks), how do I join them together to get one massive n^3 array?

Thanks,

Vandhan

Im not sure i get what youre asking…

There is nothing different, that i can see, in the way to pass n^3 instead of passing n^3.

You memcpy it to the device, work on it, and get it back.

You cant pass data directly to shared memory. Threads will fetch the data from global memory to shared memory.

If youre talking about using a struct and passing it as a parameter, then im not sure how that would work out. Id be interested in that answer as well!

As far as building them back into a big array, youll be keeping track of which sub array youre working on, its just gonna be a matter a setting the right offsets when copying back.

Maybe theres some clever way to still use a big memcpy that im not thinking off, but as far as i can tell in this 5 minutes analysis, youll have to memcpy it to a temporary array then loop with the right offsets in your “big array” to copy the data item by item.

I mean, you could structure your big array to be [block1x][block1y][block1z][block2x][…]… which would allow you to do memcpy’s from the device to the host, but i guess youd have to structure that array back into shape if you have some other processing to do on it that assumes a “normal” configuration.

Anyway, ive rambled a bit more than i thought i would, with no real substance!

Hi folks,

I’ll post of my memcopy (3D host -> array, host -> device, device->host) and blocking code later this afternoon. I used arrays to bind them to 3D textures so that I get the speedup while accessing 3D elements (using shared mem is another possibility).

One the host side, as Ailleur has said, you must do some memory management to keep correct subvolume offsets. My application required shared subvolumes (i’.e, subvolumes with 1 or 2 voxel shared border). Otherwise, if you just need to slice the volume without subvolume sharing, the code will look cleaner.

Will post it soon,

Cheers,
Oj

#define SUBVOL_DIM 128

cudaExtent volSize = make_cudaExtent(...);

cudaArray *d_volArray = 0; // subvolume bound to texture 

cudaPitchedPtr d_volPPtr;// subvolume in device memory

float* h_vol = NULL; // The full volume on host

int iDivUp(int a, int b)

{

  return ((a % b) != 0)? (a / b + 1): (a / b);

}

//Initialization and mem allocations

void initCuda()

{

...

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

cudaExtent subvolSize = make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM);

CUDA_SAFE_CALL(cudaMalloc3DArray(&d_volArray, &channelDesc, subvolSize));

...

cudaExtent pitchedVolSize = make_cudaExtent(SUBVOL_DIM*sizeof(float), SUBVOL_DIM, SUBVOL_DIM); 

CUDA_SAFE_CALL(cudaMalloc3D(&d_volPPtr, pitchedVolSize));

...

}

Host to array copy:

void copy3DHostToArray(float *_src, cudaArray *_dst, cudaExtent copy_extent, cudaPos src_offset)

{

  cudaMemcpy3DParms copyParams = {0};

  float *h_source = _src + src_offset.x + src_offset.y*volSize.width + src_offset.z*volSize.width*volSize.height;

  copyParams.srcPtr = make_cudaPitchedPtr((void*)h_source, volSize.width*sizeof(float), volSize.width, volSize.height);

  copyParams.dstArray = _dst;

  copyParams.kind = cudaMemcpyHostToDevice;

  copyParams.extent = copy_extent;

 CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

  CUT_CHECK_ERROR("Host -> Array Memcpy failed\n");

}

Device mem to array copy:

void copy3DMemToArray(cudaPitchedPtr _src, cudaArray *_dst)

{

  cudaMemcpy3DParms copyParams = {0};

  copyParams.srcPtr =  _src;

  copyParams.dstArray = _dst;

  copyParams.kind = cudaMemcpyDeviceToDevice;

  copyParams.extent = make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM);

 CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

  CUT_CHECK_ERROR("Mem -> Array Memcpy failed\n");

}

Device mem to host mem copy:

void copy3DMemToHost(cudaPitchedPtr _src, float *_dst, cudaExtent copy_extent, cudaExtent dst_extent, cudaPos src_offset, cudaPos dst_offset)

{

  cudaMemcpy3DParms copyParams = {0};

  copyParams.srcPtr = _src;

  float *h_target = _dst + dst_offset.x + dst_offset.y*dst_extent.width + dst_offset.z*dst_extent.width*dst_extent.height;//For some reason, using copyParams.dstPos doesn't give correct results, so we set the offset here.

  copyParams.dstPtr = make_cudaPitchedPtr((void*)h_target, dst_extent.width*sizeof(float), dst_extent.width, dst_extent.height);

  copyParams.kind = cudaMemcpyDeviceToHost;

  copyParams.extent = make_cudaExtent(copy_extent.width*sizeof(float), copy_extent.height, copy_extent.depth);

  copyParams.srcPos = make_cudaPos(src_offset.x*sizeof(float), src_offset.y, src_offset.z); // We want to copy copy_extent sized volume starting at (x_off, y_off, z_off).

 CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

  CUT_CHECK_ERROR("Mem -> Host Memcpy failed\n");

}

Memory management (note that there is a one voxel border around every subvolume which is shared with other subvolumes):

  

cudaExtent subvolIndicesExtents = make_cudaExtent(iDivUp(volSize.width-2, SUBVOL_DIM-2), iDivUp(volSize.height-2, SUBVOL_DIM-2), iDivUp(volSize.depth-2, SUBVOL_DIM-2));

for(int _z = 0; _z< subvolIndicesExtents.depth; _z++)

      for(int _y = 0; _y< subvolIndicesExtents.height; _y++)

        for(int _x = 0; _x< subvolIndicesExtents.width; _x++)

        {

   //copy the subvolume to texture

   copy3DHostToArray(h_vol, d_volArray, make_cudaExtent(SUBVOL_DIM, SUBVOL_DIM, SUBVOL_DIM), make_cudaPos(_x*(SUBVOL_DIM-2), _y*(SUBVOL_DIM-2), _z*(SUBVOL_DIM-2)));

     //fprintf(stderr, "->%s", cudaGetErrorString(cudaGetLastError()));

  //run a kernel on subvolume. reads from texture (via d_volArray)and writes to d_volPPtr

   d_kernel<<<gridSize, blockSize>>>(d_volPPtr, ...);

     CUT_CHECK_ERROR("Kernel failed");

     //fprintf(stderr, "---%d-%d-%d %s---", _x, _y, _z, cudaGetErrorString(cudaGetLastError()));

     cudaThreadSynchronize();

  //Copy results back to host mem from device mem

   dst_off.x = 1 + _x*(SUBVOL_DIM-2); dst_off.y = 1 + _y*(SUBVOL_DIM-2); dst_off.z = 1 + _z*(SUBVOL_DIM-2);

    copy3DMemToHost(d_volPPtr, h_phi, copyvol, volSize, src_off, dst_off); 

     //fprintf(stderr, "%s<-\n", cudaGetErrorString(cudaGetLastError()));

        }

Notes:

  1. copy3DMemToHost() is the most generic of the three functions, but t shouldn’t be difficult to do the same with other two. I have hard-coded some values in the other two.

  2. In my experience, setting a copy position (offset) on the host pitched ptr (host memory) never worked. So, I set the correct offset using pointer arithmetic myself and it works. Does anyone know why? Setting offsets on device memory and arrays always works.

  3. In case of normal 3D memory on device and host, always set the first argument of cudaPos, cudaExtent to the offset along X in “bytes”. While, with arrays, this has to be in number of elements in X direction. Anyway, this is documented in the API guide.

Most of the code is trivial and taken from cuda examples.

Hope it helps.

Cheers,

Ojaswa

Hi,

Another small doubt. How do I go about accessing the elements of d_volArray and d_volPPtr in the kernel? I want to read from d_volArray, run from modifications and store in d_volPPtr? A small code for the kernel function to do this would be awesome.

Thanks,

Vandhan!