Hey experts!
How do I pass a dynamically allocated 2D array on the host to the device without an iterative memcpy? Assume I can’t put the host array in a 1D array.
I want to do this but with one cudaMemcpy such as using a cudaMemcpy2D? What is the fastest way I can do this? :)
size_t w = 10000;
size_t h = 10000;
float* d_matrix;
float** h_matrix;
// Host Allocation
h_matrix = (float**)malloc(h * sizeof(float*));
for (int i = 0; i < h; i++)
h_matrix[i] = (float*)malloc(w * sizeof(float));
// Device Allocation
cudaMalloc(&d_matrix, h * w * sizeof(float));
// Fill host with data ...
for (int i = 0; i < h; i++)
cudaMemcpy(d_matrix + (i * width), h_matrix[i], w * sizeof(float), cudaMemcpyHostToDevice);