cudaMemcpy2D To Host


I am trying to copy a 2d array of doubles to the device and it seems that there is a problem when i get the data back…

Inverse(double **a1, double *h_h, int pos, int N) {


	Error = cudaMallocPitch((void**)&d_binv, &pd_binv, N * sizeof(double), N);

	Error = cudaMallocPitch((void**)&d_eta, &pd_eta, N * sizeof(double), N);

	Error = cudaMalloc((void**)&d_h, size);

	Error = cudaMallocPitch((void**)&d_y, &pd_y, N * sizeof(double), N);

	// Copy vectors from host memory to device memory

	Error = cudaMemcpy(d_h, h_h, size, cudaMemcpyHostToDevice);

	Error = cudaMemcpy2D(d_binv, pd_binv, a1, N * sizeof(double), N * sizeof(double), N, cudaMemcpyHostToDevice);


	Error = cudaMemcpy2D(a1, N * sizeof(double), d_y, pd_y, N * sizeof(double), N, cudaMemcpyDeviceToHost); //Here is where i get the problem


Do you have any suggestions?


Pretty enigmatic…
Did you remember to allocate a memory for a1 ??


thanks greg for the reply.

All the variables are allocated the problem is with the double pointer is it the correct way to pass it to the GPU?

No it isn’t. cudaMemcpy2D is designed for copying from pitched, linear memory sources. There is no “deep” copy function for copying arrays of pointers and what they point to in the API. You will need a separate memcpy operation for each pointer held in a1. Generally speaking, it is preferable to use linear memory with indexing when working with memory which needs to be portable between the host and device. It reduces the operation overhead considerably, and on the GPU, an integer mutliply-add per read is cheaper than the alternatives (like dereferencing several levels of pointer indirection).

so the best thing that i should do is make it a vector, to gain performance etc…


Thanks again!

Yes, “flatten” it into a piece of linear memory and just copy the whole thing to the GPU. Use the same column or row major indexing scheme to access the memory on both host and device and it should “just work”.

sorry, but I knew that deep copying is usually intended for struct, isn’t matrix always seen as array of pointers?

and, if it is, when I may use cudaMemcpy2D?