Can't get copyDeviceToHost to work with cudaMemcpy2D

icarus127 · November 13, 2009, 11:36pm

Hi all, I’m just starting out with the cuda framework with the eventual goal of using it for my thesis work. I’ve read the programmers manual, a good chunk of the best practices guide and a bunch of other things. I feel kind of silly asking this question but I can’t get cudaMemcpy2D to work. I’ve searched for threads about using 2d arrays with cudaMallocPitch etc. and all the replies I’ve seen to other people boil down to “Manage the pitch yourself, a 2D array is just compiler syntax sugar”.

While I fully understand how to do this (I’ve done it as an exercise) I want to at least understand how to use cudaMemcpy2D because cudaMallocPitch will, according to the manual, automatically to my byte alignment for me. So my first question is, why is the above answer what everyone says? Is my thinking that it would be more optimal to use MallocPitch for the byte alignment wrong? Secondly, why does this code not work I expect to see a 2D array returned with the array index in each field, I see that it is correct in my kernel, but on copy back the array holds junk. I’m checking every call for errors and not getting anything.

Even if using cudaMallocPitch is pointless I would appreciate any input by someone with more experience, if only just to understand what I’m doing wrong.

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

void CheckForCudaError()

{

  cudaError_t myError = cudaGetLastError();

  if(myError != cudaSuccess)

  {

	printf("cuda error: %s\n", cudaGetErrorString(myError));

	exit(1);

  }

}

__global__ void

PrintThreadIndexes(float *dFloatPtr, int pitch)

{

  float* row = (float*)(((char*)dFloatPtr) + pitch * threadIdx.y);

row[threadIdx.x] = threadIdx.x + threadIdx.y * blockDim.x;

//printf("threadIdx.x = %d, threadIdx.y = %d, blockDim.x = %d\n", threadIdx.x, threadIdx.y, blockDim.x);

  printf("row[%d][%d] = %02f\n", threadIdx.y, threadIdx.x, row[threadIdx.x]);

}

int main(int argv, char** argc)

{

  if(argv != 2)

  {

	 printf("Usage: %s <blockDim>\n");

	 exit(1);

  }

size_t blockDim = atoi(argc[1]);

float *devicePtr = 0;

  size_t pitch;

  cudaMallocPitch((void**)&devicePtr, &pitch, blockDim * sizeof(float), blockDim);

CheckForCudaError();

dim3 blockVector(blockDim, blockDim);

  PrintThreadIndexes<<<1, blockVector>>>(devicePtr, pitch);;

CheckForCudaError();

int hostPitch = blockDim * sizeof(float);

  float *hostArray = (float*)malloc(blockDim * hostPitch);

  cudaMemcpy2D(hostArray, hostPitch, devicePtr, pitch, blockDim, blockDim, cudaMemcpyDeviceToHost);

cudaThreadSynchronize();

  CheckForCudaError();

int i, j;

  for(i = 0; i < blockDim; ++i)

  {

	for(j = 0; j < blockDim; ++j)

	{

	  printf("[%02f]", hostArray[i * hostPitch + j]);

	}

	printf("\n");

  }

printf("\n");

free(hostArray);

  cudaFree(devicePtr);

return 0;

}

When I run with ./CopyTest 4 I get this:

bash-3.1$ ./CopyTest 4

row[0][0] = 0.000000

row[0][1] = 1.000000

row[0][2] = 2.000000

row[0][3] = 3.000000

row[1][0] = 4.000000

row[1][1] = 5.000000

row[1][2] = 6.000000

row[1][3] = 7.000000

row[2][0] = 8.000000

row[2][1] = 9.000000

row[2][2] = 10.000000

row[2][3] = 11.000000

row[3][0] = 12.000000

row[3][1] = 13.000000

row[3][2] = 14.000000

row[3][3] = 15.000000

[0.000000][0.000000][0.000000][0.000000]

[-0.000028][0.000000][0.000000][0.000000]

[0.000000][-0.000028][0.000000][0.000000]

[0.000000][-0.000028][0.000000][0.000000]

I’m running under device emulation mode right now obviously so we can see in the kernel the value is correct, but on the print from the copy destination there is only garbage. When I run it on a device I get a completely zeroed array. I assume I’m doing something dumb that I don’t understand, and help/comments/pointers to useful examples would be much appreciated.

Thanks in advance.

Topic		Replies	Views
2D array & Memory space Mostly about cudaMallocPitch & cudaMemcpy2D CUDA Programming and Performance	1	1485	October 15, 2009
cudaMemcpy2D help CUDA Programming and Performance	4	10596	July 28, 2009
Question about cudaMemcpy2D CUDA Programming and Performance	0	2775	April 22, 2008
cudaMemcpy2D example? CUDA Programming and Performance	5	19585	February 1, 2012
help with cudaMemcpy2D I can't get a matrix/ array to copy correctly from host to device CUDA Programming and Performance	3	5047	July 14, 2009
cudaMemcpy2D error CUDA Programming and Performance	1	1126	November 11, 2009
problem with cudaMallocPitch and cudaMemcpy2D CUDA Programming and Performance	5	6363	April 22, 2009
2D array of int copy from host to device and vice versa How to do it? CUDA Programming and Performance	1	890	July 9, 2010
test on 'cudaMallocPitch' and 'cudaMemcpy2D' CUDA Programming and Performance	1	587	November 16, 2010
help cudaMemcpy2d Trying to modify a 2d array on cuda device CUDA Programming and Performance	8	5011	September 11, 2010

Can't get copyDeviceToHost to work with cudaMemcpy2D

Related topics