when you upload array to GPU record the Device pointer where array was placed (for example if you allocate memory on Device for your arrays store the returned pointer, if you allocate one array for everything store the pointer to the beginning of this memory chunk and keep an array of offsets)
each time when you upload your array to GPU record device pointer (or device offset) in a CPU-based array.
Then upload this CPU-based array of device pointers or device offsets to Device.
Have you actually tried this? I don’t think this will work as you’ve described it. What you get back from cudaMalloc is a “pointer” stored on the host that maps to memory allocated in the GPU. I do not think that the pointer is actually a meaningful pointer to GPU memory. It almost might be more meaningful for cudaMalloc to take something like a cudaMemLookupKey_t * instead of a void **.
Anyways, this pointer, if placed in an array in host memory and copied to memory allocated on the GPU will not be useful when you try to dereference it on the GPU. Or so I think. :)
Due to the fact that there’s no MMU on GPU (maybe not true, just guess), so you will run into memory fragmentation after numbers of allocation and deallocation (various size). And also in my own experience, I dont think the CUDA allocator is reliable for frequent allocation/deallocation.
In addition, when you allocate a block of GPU memory, GPU does something behind the scene, which takes some time. For example, if you try to allocate 600MB on 8800GTX, it would takes up to 20ms! The initialization time is depending on the allocation size.
So I came up to a solution: write your own memory allocator on GPU. Basically you can reserve a large block of memroy on GPU first, and slice it into chunks for application use. Also when there’s fragmentation problem (no free block is greater then requsted size), I do the de-fragmentation by compacting blocks. This works great for me and makes the allocation/deallocation process in constant time.
By the way, just to clearify, the pointer address returned by cudaMalloc is the actual memory address sued on GPU. Not sure if this is true for next-generation card like GT200, but at least true for G80/G92 cores.