Can't get copyDeviceToHost to work with cudaMemcpy2D

Hi all, I’m just starting out with the cuda framework with the eventual goal of using it for my thesis work. I’ve read the programmers manual, a good chunk of the best practices guide and a bunch of other things. I feel kind of silly asking this question but I can’t get cudaMemcpy2D to work. I’ve searched for threads about using 2d arrays with cudaMallocPitch etc. and all the replies I’ve seen to other people boil down to “Manage the pitch yourself, a 2D array is just compiler syntax sugar”.

While I fully understand how to do this (I’ve done it as an exercise) I want to at least understand how to use cudaMemcpy2D because cudaMallocPitch will, according to the manual, automatically to my byte alignment for me. So my first question is, why is the above answer what everyone says? Is my thinking that it would be more optimal to use MallocPitch for the byte alignment wrong? Secondly, why does this code not work I expect to see a 2D array returned with the array index in each field, I see that it is correct in my kernel, but on copy back the array holds junk. I’m checking every call for errors and not getting anything.

Even if using cudaMallocPitch is pointless I would appreciate any input by someone with more experience, if only just to understand what I’m doing wrong.

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

void CheckForCudaError()


  cudaError_t myError = cudaGetLastError();

  if(myError != cudaSuccess)


	printf("cuda error: %s\n", cudaGetErrorString(myError));




__global__ void

PrintThreadIndexes(float *dFloatPtr, int pitch)


  float* row = (float*)(((char*)dFloatPtr) + pitch * threadIdx.y);

row[threadIdx.x] = threadIdx.x + threadIdx.y * blockDim.x;

//printf("threadIdx.x = %d, threadIdx.y = %d, blockDim.x = %d\n", threadIdx.x, threadIdx.y, blockDim.x);

  printf("row[%d][%d] = %02f\n", threadIdx.y, threadIdx.x, row[threadIdx.x]);


int main(int argv, char** argc)


  if(argv != 2)


	 printf("Usage: %s <blockDim>\n");



size_t blockDim = atoi(argc[1]);

float *devicePtr = 0;

  size_t pitch;

  cudaMallocPitch((void**)&devicePtr, &pitch, blockDim * sizeof(float), blockDim);


dim3 blockVector(blockDim, blockDim);

  PrintThreadIndexes<<<1, blockVector>>>(devicePtr, pitch);;


int hostPitch = blockDim * sizeof(float);

  float *hostArray = (float*)malloc(blockDim * hostPitch);

  cudaMemcpy2D(hostArray, hostPitch, devicePtr, pitch, blockDim, blockDim, cudaMemcpyDeviceToHost);



int i, j;

  for(i = 0; i < blockDim; ++i)


	for(j = 0; j < blockDim; ++j)


	  printf("[%02f]", hostArray[i * hostPitch + j]);







return 0;


When I run with ./CopyTest 4 I get this:

bash-3.1$ ./CopyTest 4

row[0][0] = 0.000000

row[0][1] = 1.000000

row[0][2] = 2.000000

row[0][3] = 3.000000

row[1][0] = 4.000000

row[1][1] = 5.000000

row[1][2] = 6.000000

row[1][3] = 7.000000

row[2][0] = 8.000000

row[2][1] = 9.000000

row[2][2] = 10.000000

row[2][3] = 11.000000

row[3][0] = 12.000000

row[3][1] = 13.000000

row[3][2] = 14.000000

row[3][3] = 15.000000





I’m running under device emulation mode right now obviously so we can see in the kernel the value is correct, but on the print from the copy destination there is only garbage. When I run it on a device I get a completely zeroed array. I assume I’m doing something dumb that I don’t understand, and help/comments/pointers to useful examples would be much appreciated.

Thanks in advance.