I am trying to copy a 2d array of doubles to the device and it seems that there is a problem when i get the data back…
Inverse(double **a1, double *h_h, int pos, int N) {
[...]
Error = cudaMallocPitch((void**)&d_binv, &pd_binv, N * sizeof(double), N);
Error = cudaMallocPitch((void**)&d_eta, &pd_eta, N * sizeof(double), N);
Error = cudaMalloc((void**)&d_h, size);
Error = cudaMallocPitch((void**)&d_y, &pd_y, N * sizeof(double), N);
// Copy vectors from host memory to device memory
Error = cudaMemcpy(d_h, h_h, size, cudaMemcpyHostToDevice);
Error = cudaMemcpy2D(d_binv, pd_binv, a1, N * sizeof(double), N * sizeof(double), N, cudaMemcpyHostToDevice);
[...]
Error = cudaMemcpy2D(a1, N * sizeof(double), d_y, pd_y, N * sizeof(double), N, cudaMemcpyDeviceToHost); //Here is where i get the problem
}
No it isn’t. cudaMemcpy2D is designed for copying from pitched, linear memory sources. There is no “deep” copy function for copying arrays of pointers and what they point to in the API. You will need a separate memcpy operation for each pointer held in a1. Generally speaking, it is preferable to use linear memory with indexing when working with memory which needs to be portable between the host and device. It reduces the operation overhead considerably, and on the GPU, an integer mutliply-add per read is cheaper than the alternatives (like dereferencing several levels of pointer indirection).
Yes, “flatten” it into a piece of linear memory and just copy the whole thing to the GPU. Use the same column or row major indexing scheme to access the memory on both host and device and it should “just work”.