cudaMemcpy3D and cudaMalloc cudaDeviceToDevice copies

I am struggling with cudaMemcpy3D with CUDA 3.0. The call usually fails. After a lot of try-and-error - not much is to be found on the forums - I think I found the reason why… however I am still puzzled why things don’t work as I think they should work. Let me give a little example:

cudaArray* dst;

float* src;

cudaExtent e = make_cudaExtent(52, 75, 165);

cudaChannelFormatDesc channelDescCoeff = cudaCreateChannelDesc<float>();

cudaMalloc3DArray(&dst, &channelDescCoeff, e); /* destination */

cudaMalloc(&src, e.width * e.height * e.depth * sizeof(float)); /* source */

cudaPitchedPtr pitched_src = make_cudaPitchedPtr(src, e.width * sizeof(float), e.width, e.height);

cudaMemcpy3DParms copyParams = {0};

copyParams.extent   = e;

copyParams.kind	 = cudaMemcpyDeviceToDevice;

copyParams.dstArray = dst;

copyParams.srcPtr   = pitched_src;

cudaMemcpy3D(&copyParams);

The code is as suggested on the forums and any finds through google. This fails!! copying the data if cudaExtent.width is not a power of 2.

Before continuing, let me make clear that if I do the same thing but for HostToDevice, everything works!

The whole clue, or magic of the thing is in make_cudaPitchedPtr. Apparently it should be the following:

int pitch = next_pow_of_2(e.width * sizeof(float));

cudaPitchedPtr pitched_src = make_cudaPitchedPtr(src, pitch, e.width * sizeof(float), e.height);

So, if e.width is 52, then it is 208B and pitch is 256. But in this case I am actually copying more data than I have allocated and I should use the following

cudaPitchedPtr pitched_src;

cudaExtent e_new = make_cudaExtent(e.width  * sizeof(float), e.height, e.depth);

cudaMalloc3D(&pitched_src, e_new);

cudaMemcpy3DParms copyParams = {0};

copyParams.extent   = e;

copyParams.kind	 = cudaMemcpyDeviceToDevice;

copyParams.dstArray = dst;

copyParams.srcPtr   = pitched_src;

cudaMemcpy3D(&copyParams);

So, can I, or can I not have memory allocated through cudaMalloc() participate in a cudaMemcpy3D? It seems not, if width is a non-power-of-two. However, a HostToDevice copy with a simple malloc() does work. What are our options? I do want to use cudaMalloc() as some external kernels access this data raw without taking pitch into account.

sorry for the bump, but does NVidia have an opinion on this?

This topic is all kinds of confusing compounded by poor documentation on NVIDIA’s part and even worse a bad example with vague references cudaMemcpy3D such as the Programming Guide Version 3.0 pg 20 (which is the bad example). I also noted that there are no references in the NVIDIA CUDA Best Practices Guide 3.0 even though it is recommended that this functionality be used.

The bad example

–snip from nvidia programmers guide –

The following code sample allocates a width×height×depth 3D array of floating-point values and shows how to loop over the array elements in device code:

–end snip –

err… huhhh… Not sure that this statement (above) is correct at least from reading the API documentation… but more on this later

// Host code 

cudaPitchedPtr devPitchedPtr; 

cudaExtent extent = make_cudaExtent(64, 64, 64); 

cudaMalloc3D(&devPitchedPtr, extent); 

MyKernel<<<100, 512>>>(devPitchedPtr, extent); 

// Device code 

__global__ void MyKernel(cudaPitchedPtr devPitchedPtr, cudaExtent extent) 

{ 

  char* devPtr = devPitchedPtr.ptr; 

  size_t pitch = devPitchedPtr.pitch; 

  size_t slicePitch = pitch * extent.height; 

  for (int z = 0; z < extent.depth; ++z) 

  { 

	 char* slice = devPtr + z * slicePitch; 

	 for (int y = 0; y < extent.height; ++y) 

	 { 

		 float* row = (float*)(slice + y * pitch); 

		 for (int x = 0; x < extent.width; ++x) 

		 { 

			  float element = row[x];

		 }

	 }

   }

}

Now is this 64x64x64 floats or 64x64x64 bytes? ← 64 million dollar question or may be 64x64x64 million dollar question.

Note that this code only performs a set on at value called “element” of values retrieved from the allocated cuda memory… errr well at least 64/sizeof(float)6464 of the allocated memroy the rest comes from random other places in GPU memory.

What would have happend to this code if the code set the row value instead? It turns out from my experience on a Tesla C1060, Quatro 1700M, and GeForce 9800 it 1) corrupts the video buffer and 2) crashes Win Server 2008 x64, Win 7 x64, and Win XP 32. Why?.. the key here is what the range of values x can take on, the type of row (pointer to float), and the max value of extend.width (64 from Nvidia example see make_cudaExtend(64,64,64) ).

– rant begin … step on soapbox–

I love magic numbers like 64 when used in code especially code meant to be used as an example to teach others without a variable named assigned like say x_dim = 64.

– rant end … step off soapbox –

From reading:

Now the important take home points in this are that pitch is in bytes and xsize is in bytes while ysize is not in bytes and represents a dimension integer.

So if the documentation can be believed and I have interpreted it correctly (which I am not saying I have) then the NVIDIA example is not a 64x64x64 (cubed) array of floats, but rather a 64/sizeof(float)x64x64 or 16x64x64 array of floats.

So how about cudaExtent how to I define that?

from reading this it looks as though I create an extent in terms of x_num_bytesy_num_bytesz_num_bytes so would this mean:

cudaExtent myExtent = cudaExtent( xdim*sizeof(float), ydim*sizeof(float), zdim*sizeof(float));

//or maybe

cudaExtent myExtent = cudaExtent( xdim*sizeof(float)^(1/3), ydim*sizeof(float)^(1/3), zdim*sizeof(float)^(1/3));

//or maybe as seen in the forums

cudaExtent myExtent = cudaExtent( xdim*sizeof(float), ydim, zdim);

I use the third method cudaExtent myExtent = cudaExtent( xdim*sizeof(float), ydim, zdim); my point is only that this api call is confusing. If pitch is only in 1 direction (x) what is the purpose of specifying y and z dimensions in bytes and not elements?

So what does the msvs debugger say about devPitchedPtr on a Tesla C1060 after calling:

cudaMalloc3D(&devPitchedPtr, extent);

Hmm now that’s weird … pitch = 256 not 1 ???

so what is devPitchedPtr if we try:

cudaExtent extent = make_cudaExtent(64*sizeof(float), 64, 64);

cudaMalloc3D(&devPitchedPtr, extent);

Hmm pitch still 256 but now xsize is 256 which would lead me to believe is correct based on documentation regarding pitch and xsize in cudaMemset3D in that they should both be in bytes and match in size.

Hmm… errr. but wait a miniute pitch is for memroy coallesed access to speed transfers so lets try this:

cudaExtent extent = make_cudaExtent(65*sizeof(float), 64, 64);

cudaMalloc3D(&devPitchedPtr, extent);

Now pitch has changed from 256 to 512 as expected… What you say… as expected? Well yes 644 = 256 but 654 = 260 so the pitch must change to the next multiple of the base pitch size which is 2*256. Would there be gobs of wasted memroy in this example? I belive so. If not someone can correct me on this.

So why did the NVIDIA code work for the programmer who wrote the documentation… my guess: Likely the pitch returned by cudaMalloc3D was also 256 and 256/sizeof(float) = 256/4 = 64. Would have been nice to see the actual cudaMemcpy3D code as I suspect that there would or could be some alignment issues in the data returned as the code asked for 64bytes, but used 644 bytes in the cuda kernel. This code probably only worked for reads of allocated memory of 64/464*64 the rest was likely reads from random GPU memory.

So what would cudaMemcpy3D return in this case of the original NIVDIA example? I modified it to change the row value than

__global__ void MyKernel(cudaPitchedPtr devPitchedPtr, cudaExtent extent)

{

	char* devPtr = (char*)devPitchedPtr.ptr;

	size_t pitch = devPitchedPtr.pitch;

	size_t slicePitch = pitch * extent.height;

	for (int z = 0; z < extent.depth; ++z)

	{

		char* slice = devPtr + z * slicePitch;

		for (int y = 0; y < extent.height; ++y)

		{

			float* row = (float*)(slice + y * pitch);

			for (int x = 0; x < extent.width; ++x)

			{

//				row[x] = x + y + z;

				row[x] = 1;

			}

		}

	}

	

	__syncthreads();

}

The answer as stated earlier is corruption of video memory and system crashes.

I modified this to what I think is correct:

__global__ void MyVersionOfKernel(cudaPitchedPtr devPitchedPtr, cudaExtent extent)

{

	// Cuda extent (x,y,z) in bytes!!!

	char* devPtr = (char*)devPitchedPtr.ptr;

	size_t pitch = devPitchedPtr.pitch;

	size_t slicePitch = pitch * extent.height;

	for (int z = 0; z < extent.depth; ++z)

	{

		char* slice = devPtr + z * slicePitch;

		for (int y = 0; y < extent.height; ++y)

		{

			float* row = (float*)(slice + y * pitch);

			// Since row is float* and extent.width IS IN BYTES!

			for (int x = 0; x < (extent.width)/sizeof(float); ++x)

			{

//				row[x] = x + y + z;

				row[x] = 1;

			}

		}

	}

	

	__syncthreads();

}

Note the modification of extent.width to (extent.width)/sizeof(float);. The variable x still ranges from 0-63 and because row is pointer to float, array indexes using x should increment the pointer by sizeof(float). The key is extent.width is in bytes, there needs to be enough memory allocated, and the kernel needs to correctly index into the memory!!! I allocate using make_cudaExtent(dimx * sizeof(T), dimy, dimz); and make_cudaPitchedPtr( (void*)mv_host_data_ptr, dimx * sizeof(T), dimy, dimz ); as the constructor for my template helper class I am using shows:

CUDAMemoryCopy3D( void* data, size_t dimx, size_t dimy, size_t dimz )

{

	mv_host_data_ptr = data;

	m_extent = make_cudaExtent(dimx * sizeof(T), dimy, dimz);

	cudaMalloc3D(&m_devPitchedPtr, m_extent);

	m_hostPitchedPtr = make_cudaPitchedPtr( (void*)mv_host_data_ptr, dimx * sizeof(T), dimy, dimz );

}

Here’s a guy that must have said to himself “my god this is awful lets wrap a class around it”. Well at least that’s what I was thinking when I googled around to find just this sort of thing. Too bad he did not think of allowing the programmer to specify a void* pointer for an already existing byte packed array of host data.

http://cudatemplates.sourceforge.net/doc/html/

// allocate device memory:

cudaExtent extent;

extent.width = SIZE[0];

extent.height = SIZE[1];

extent.depth = SIZE[2];

cudaPitchedPtr mem_device;

CUDA_CHECK(cudaMalloc3D(&mem_device, extent));

I don’t think the extent is defined properly as extend.width should be extent.width = SIZE[0] * sizeof(float); or for template code extent.width = SIZE[0] * sizeof(T);

Hopefully I have understood the API and that this does not simply add to the confusion. Hope this helps.

Looks as though this problem has already been discovered see:

Error in CUDA programming guide?

Looks as though this problem has already been discovered see:

Error in CUDA programming guide?