I would like to create a three-dimensional array with dimensions of inconsistent sizes. Most of the documentation for multi-dimensional arrays advocates creating a one-dimensional array of size lengthwidthheight and manually calculating the correct index. This would be unacceptably inefficient in my application, as I have a small number of very large sub-arrays, and a large number of much smaller sub-arrays. If I were to try to allocate an array with consistent dimensions large enough to fit all of the sub-arrays, I wouldn’t have enough memory.
On the CPU, I create an array of (float**), and link each element of this array to an array of (float*), and link each element of these arrays to an array of floats. cudaMalloc only accepts pointers to (void*) arrays, but I presume all pointers should be of the same size, so I should be able to cast to (void*) safely.
My question really pertains to the linking of all these arrays together. This seems like a serial task best performed on the CPU. Can I cudaMalloc all of my arrays separately, link them together on the CPU, and then cudaMemcpy then onto the GPU? In particular, what is the meaning of the value returned by cudaMalloc()? Can I assign the values returned by cudaMalloc for the sub-arrays to the entries of the highest-level array on the CPU, cudaMemcpy the highest level array to the GPU, and expect the structure to be correct in the global memory of the device? The documentation indicates that the GPU uses normal 32-bit pointers, so this seems like a reasonable thing to do, assuming that cudaMalloc just returns 32-bit addresses into the global memory, but perhaps I am confused.