copying memory to and from 3D pitched pointers

Hi, I wrote the following two functions for copying memory to/from 3D pitched pointers (modified from older functions I found on this forum):

void copy3DHostToPitchedPtr(float *_src, cudaPitchedPtr _dst, int width, int height, int depth)

{

  cudaExtent copy_extent = make_cudaExtent(width*sizeof(float),height,depth);

  cudaMemcpy3DParms copyParams = {0};

  float *h_source = _src;

  copyParams.srcPtr = make_cudaPitchedPtr((void*)h_source, copy_extent.width, copy_extent.width/sizeof(float), copy_extent.height);

  copyParams.dstPtr = _dst;

  copyParams.kind = cudaMemcpyHostToDevice;

  copyParams.extent = copy_extent;

CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

  CUT_CHECK_ERROR("Host -> Device Memcpy failed\n");

}

and:

void copy3DPitchedPtrToHost(cudaPitchedPtr _src, float *_dst, int width, int height, int depth)

{

  cudaExtent copy_extent = make_cudaExtent(width*sizeof(float),height,depth);

  cudaMemcpy3DParms copyParams = {0};

  float *h_dest = _dst;

  copyParams.srcPtr = _src;

  copyParams.dstPtr = make_cudaPitchedPtr((void*)h_dest, width*sizeof(float), width, height);

  copyParams.kind = cudaMemcpyDeviceToHost;

  copyParams.extent = copy_extent;

CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

  CUT_CHECK_ERROR("Device -> Host Memcpy failed\n");

}

I also wrote a small test kernel that takes a 3d pitched pointer and populates it with numbers related to its element indexes:

__global__ void testKernel(cudaPitchedPtr v, int width, int height, int depth) {

	//calculate x y z coordinates just for test purposes

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.x*blockDim.y + threadIdx.y;

	unsigned int z = blockIdx.x*blockDim.z + threadIdx.z;

	char* devPtr = (char *) v.ptr;

	size_t pitch = v.pitch;

	size_t slicePitch = pitch*height;

	char* slice = devPtr + z*slicePitch;

	float* row = (float*)(slice + y*pitch);

	

	//overwrite element

	row[x] = x + y + z;

}

However, when I execute the main function I get an error while copying back from device to host (‘invalid configuration argument’). I also find quite weird the fact that if I do not call my test kernel, but simply copy the memory host->device and then device->host straight away I get no errors and a correct memory transfer. Here is the main function:

void cudaTest(float *v, int width, int depth, int height) {

        cudaPitchedPtr vGPU_;

	width_ = width;

	height_ = height;

	depth_ = depth;

//allocate memory

        cudaExtent pExtent = make_cudaExtent(width*sizeof(float),height,depth);

	CUDA_SAFE_CALL(cudaMalloc3D(vGPU_,pExtent));

	

        //copy host->device

        copy3DHostToPitchedPtr(v,vGPU_,width_,height_,depth_);

//kernel execution

	dim3 threadBlock(8,8,2);

	dim3 gridBlock(iDivUp(width_,threadBlock.x), iDivUp(height_,threadBlock.y), iDivUp(depth_,threadBlock.z));

	testKernel<<<gridBlock,threadBlock>>>(vGPU_,width_,height_,depth_);

//copy device->host

        copy3DPitchedPtrToHost(vGPU_,v,width_,height_,depth_);

}

I am not too sure what I am doing wrong with such a simple code. I am using a GeForce GTX 280 (1.3 computing capabilities) and CUDA 4.0.

“Invalid Configuration Argument” is a kernel launch error. cudaMemcpy and its derivatives will return kernel launch errors.

In other words, test to make sure that your kernel is launching correctly. You likely have a grid dimension that is 0.

Thanks for the reply. When I get the error I also get a line number, which points to the cudaMemcpy3D call in the copy3DPitchedPtrToHost function. I double checked the grid size just to make sure and the values appear to be legal. Could it be something in the size of the host destination pitched pointer?

void copy3DPitchedPtrToHost(cudaPitchedPtr _src, float *_dst, int width, int height, int depth)

{

  cudaExtent copy_extent = make_cudaExtent(width*sizeof(float),height,depth);

  cudaMemcpy3DParms copyParams = {0};

  float *h_dest = _dst;

  copyParams.srcPtr = _src;

  copyParams.dstPtr = make_cudaPitchedPtr((void*)h_dest, width*sizeof(float), width, height);

  copyParams.kind = cudaMemcpyDeviceToHost;

  copyParams.extent = copy_extent;

CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

  CUT_CHECK_ERROR("Device -> Host Memcpy failed\n");

}

Hmm, well you can still get a kernel launch error using CUDA_SAFE_CALL because CUDA_SAFE_CALL goes off of what cudaMemcpy3D spits out. cudaMemcpy3D will spit out an error from your kenel launch and CUDA_SAFE_CALL will think that the error originated from cudaMemcpy3D. The error that you say that you are getting only comes from a kernel launch failure. cudaMemcpy3D will return: cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidPitchValue, cudaErrorInvalidMemcpyDirection, or an error from any other asynchronous call (your kernel).

Try the following changes to your main code:

int main(void)

{

   cudaError status;

// host code....

// kernel launch ....

   // Immediately following your kernel launch: 

   cudaDeviceSynchronize();

   status = cudaGetLastError();

   if(status != cudaSuccess){fprintf(stderr, "radix sort %s\n", cudaGetErrorString(status));}

// rest of your code...

If that doesn’t work here is what I do for my cudaMemcpy3DToHost parameters:

cudaMemcpy3DParms p = { 0 };

//h_extent is the extent used when allocating the 3D device memory

int x = h_extent.width/sizeof(const float);

int y = h_extent.height;

int z = h_extent.depth;

p.srcPtr.ptr = h_pitchedptr.ptr;

p.srcPtr.pitch = h_pitchedptr.pitch;

p.srcPtr.xsize = x;

p.srcPtr.ysize = y;

p.dstPtr.ptr = ((void**)src);

p.dstPtr.pitch = x*sizeof(const float);

p.dstPtr.xsize = x;

p.dstPtr.ysize = y;

p.extent.width = x*sizeof(const float);

p.extent.height = y;

p.extent.depth = z;

p.kind = cudaMemcpyDeviceToHost;

But like I said, I’m pretty confident that the problem is in your kernel.

You are absolutely spot on. It seems that the problem is the z size of my grid block, which would throw out that error whenever the third dimension is > 1. I looked around the forum and it seems that despite the fact that gridBlock is a 3d vector, the only legal value for the third dimension is 1. However, now it is unclear to me how to generate the indexes within a kernel if I have a 3D thread block and a 2d grid…

Yeah the problem is probably that compute capability 1.3 does not support 3D grids.

What you can do is use the y-dimension of your grid for both the y and z indices.

gidy = blockDim.y*blockIdx.y+threadIdx.y;

y_index = gidy/height;

z_index = gidy % height;

I changed platform, so now I’m using a Quadro 5000, that should have 2.0 computing capabilities on Win7 64. I left the code unchanged in its structure, however now I get a cudaErrorUnknown immediately after the kernel call, despite the 3D sizes of gridBlock and threadBlock being correct (48,36,4) and (8,8,8) respectively. Is there any way of finding out what the problem might be?