cudaMemcpy2D, simple code but it doesn't work

Hi everybody,
I need to initialize a matrix using cudaMemset first and then copying it on the host.
This is the code

int width, height;
height = nij;
width = nspares;
double *devPtr, *hostPtr;
size_t d_pitch, h_pitch ;

   // allocation
    datoin->prec_ini = (double **)malloc(nij * sizeof(double *));
for  (ij = 0; ij < nij; ij++) { 
	datoin->prec_ini[ij] = (double *)malloc(nspares * sizeof(double)); 

cudaMallocPitch((void**)&devPtr, &d_pitch, width * sizeof(double), height);

hostPtr = (double *)malloc(sizeof(double)*nij*nspares);
cudaMemset(devPtr, 0, width*height*sizeof(double));

cudaMemcpy2D(datoin->prec_ini, width*sizeof(double), devPtr, d_pitch, width * sizeof(double), height, cudaMemcpyDeviceToHost); 

but cudaMemcpy2D gets a segmentation fault.
I don’t want to flatten everything, nij is up to 300.000 and I “must” understand where is the trouble!
Can’t anybody help me?


The 2D matrix on the host appears to be a collection of independently allocated rows, plus a vector of pointers datoin->prec_ini, each element of which points to the start of one row. This means storage for the matrix is non-contiguous.

cudaMemcpy2D() expects the rows of the 2D matrix to be stored contiguously, and be passed a pointer to the start of the first row. Instead the code passes a pointer to the array of row pointers. As this uses much less storage than the 2D matrix expected, an out of bounds access occurs on the host side of the copy, leading to a segmentation fault.

To copy this sort of non-contiguously allocated 2D matrix on the host to a contiguously stored 2D matrix on the host, the code will have to copy each row of the matrix individually.

Thank you very much for the answer, fast, clear and very useful!