Passing dynamically allocated 2D array to device

Hey experts!

How do I pass a dynamically allocated 2D array on the host to the device without an iterative memcpy? Assume I can’t put the host array in a 1D array.

I want to do this but with one cudaMemcpy such as using a cudaMemcpy2D? What is the fastest way I can do this? :)

size_t w = 10000;
size_t h = 10000;

float* d_matrix;
float** h_matrix;

// Host Allocation
h_matrix = (float**)malloc(h * sizeof(float*));
for (int i = 0; i < h; i++)
    h_matrix[i] = (float*)malloc(w * sizeof(float));

// Device Allocation
cudaMalloc(&d_matrix, h * w * sizeof(float));

// Fill host with data ...

for (int i = 0; i < h; i++)
    cudaMemcpy(d_matrix + (i * width), h_matrix[i], w * sizeof(float), cudaMemcpyHostToDevice);

This is an FAQ. What you have on the host is not a 2D array. It’s an array of pointers to row or column vectors. Since this is not contiguous storage as in a true 2D array, you will have to copy each row/column vector to the GPU individually.

Note that host-side libraries such as BLAS assume contiguously stored 2D arrays as well, so it may be best to change the host-side data structure to that. This then allows copying the matrix either with cudaMemcpy [when transferring the entire matrix] or cudaMemcpy2D [when transferring a sub-matrix].

ahh I see. Thanks again!