Multi-dimensional arrays in global memory

I would like to create a three-dimensional array with dimensions of inconsistent sizes. Most of the documentation for multi-dimensional arrays advocates creating a one-dimensional array of size lengthwidthheight and manually calculating the correct index. This would be unacceptably inefficient in my application, as I have a small number of very large sub-arrays, and a large number of much smaller sub-arrays. If I were to try to allocate an array with consistent dimensions large enough to fit all of the sub-arrays, I wouldn’t have enough memory.

On the CPU, I create an array of (float**), and link each element of this array to an array of (float*), and link each element of these arrays to an array of floats. cudaMalloc only accepts pointers to (void*) arrays, but I presume all pointers should be of the same size, so I should be able to cast to (void*) safely.

My question really pertains to the linking of all these arrays together. This seems like a serial task best performed on the CPU. Can I cudaMalloc all of my arrays separately, link them together on the CPU, and then cudaMemcpy then onto the GPU? In particular, what is the meaning of the value returned by cudaMalloc()? Can I assign the values returned by cudaMalloc for the sub-arrays to the entries of the highest-level array on the CPU, cudaMemcpy the highest level array to the GPU, and expect the structure to be correct in the global memory of the device? The documentation indicates that the GPU uses normal 32-bit pointers, so this seems like a reasonable thing to do, assuming that cudaMalloc just returns 32-bit addresses into the global memory, but perhaps I am confused.


Well, the pointers are the same size as the pointers on the host: so they will be 64-bit on 64-bit platforms. But that is just a side issue.

You’ve got it right: cudaMalloc just returns a pointer into the device memory. So you can allocate a float** list of pointers on the device, allocate a float** list of pointers on the host, cudaMalloc every pointer into the list of pointers on the host and copy the list of pointers to the device. The compiler will tell you “warning, can’t tell what pointer points to: assuming global memory space” when you dereference your memory, but that is OK because you have allocated global memory after all.

You’ll have to be careful with your memory access pattern to get coalesced reads since your pointers are now being read from global memory too. How will your threads access the list? If a block takes one of the float* and does stuff with it, then you can have threadIdx.x==0 load in the float* into shared memory from that float** and have the block work on that.

Thanks. Sounds like everything should work out just ducky, then. By the way, do you know why cudaMalloc takes in a (void**) and alters the content of the provided address rather than just returning a (void*) like normal malloc()? Given that the two would seem to have the same overall effect, it seems odd to deviate from the standard C usage.


Because every cuda* function (except cudaGetErrorString) returns a cudaError_t indicating whether the function call was successful or not and what the error was. It must have been a design choice that the developers made to have completely consistent error handling across all cuda functions so that macros such as CUDA_SAFE_CALL can be used for error checking.