cudaMemcpy2D example?

I was looking through the programming tutorial and best practices guide. There is a very brief mention of cudaMemcpy2D and it is not explained completely. I have searched C/src/ directory for examples, but cannot find any. I also got very few references to it on this forum.
I wanted to know if there is a clear example of this function and if it is necessary to use this function in conjunction with cudaMallocPitch()?
Thanks and Regards.

Check out the Reference Manual or the online documentation.

Yeah, I saw that, however, I am trying to get the following code but I am not able to get it working. What I intended to do was to copy a host array of 760760 which would be inefficient to access to an array of 768768 which would be efficient for my device of compute capability 1.2 (gt 230m with 6 SM, hence the 128*6).

Could you please take a look at it? I would be glad to finally understand this function.




#define N 760 // side of matrix containing data

#define PDIM 768 // padded dimensions

#define TPB 128 //threads per block

#define INDEX 190 //verification index

#define DIV 6

//load element from da to db to verify correct memcopy

__global__ void kernel(float * da, float * db)


	int tid = blockDim.x * blockIdx.x + threadIdx.x; 

	if(tid%PDIM < N)


		db[(blockIdx.x/DIV)*N + (blockIdx.x%DIV)*blockDim.x + threadIdx.x] = da[tid];



void verify(float * A, float * B, int size);

void init(float * array, int size);

int main(int argc, char * argv[])


	float * A, *dA, *B, *dB;

	A = (float *)malloc(sizeof(float)*N*N);

	B = (float *)malloc(sizeof(float)*N*N);



	printf("\n%f ", A[INDEX]);

	cudaMalloc(&dA, sizeof(float)*PDIM*PDIM);

	cudaMalloc(&dB, sizeof(float)*N*N);


//copy memory from unpadded array A of 760 by 760 dimensions

//to more efficient dimensions of 768 by 768 on the device


	int threadsperblock = TPB;

	int blockspergrid = PDIM*PDIM/threadsperblock;


	cudaMemcpy(B, dB, sizeof(float)*N*N, cudaMemcpyDeviceToHost);


	printf("->%f\n", B[INDEX]);







void init(float * array, int size)


	for (int i = 0; i < size; i++)


		array[i] = i;



void verify(float * A, float * B, int size)


	for (int i = 0; i < size; i++)





Widths and pitches are in bytes, not number of elements (the latter would not work because cudaMemcpy2D() does not know the element size).

Amazing. Thanks a ton. I cannot believe that I was making such a mistake.

I have another question though, if you don’t mind. Is this a legitimate method to avoid possible uncoalesced accesses from a two dimensional matrix? Also, would it make more sense to use this in conjunction with cudaMallocPitch() as opposed to a pseudo two dimensional array?

Thanks and Regards.

Yes, cudaMallocPitch() is exactly meant to easily find the appropriate alignment and pitch for the current device to avoid uncoalesced accesses.