List of pointers to device

I want to copy a number of arrays to the device. Each of these arrays contains uint values.

I then want to upload a array of pointers the the previous arrays described. The part that I am struggling with is to copy an array of pointers to the device.

Does anyone have an idea on how this can be achieved?

when you upload array to GPU record the Device pointer where array was placed (for example if you allocate memory on Device for your arrays store the returned pointer, if you allocate one array for everything store the pointer to the beginning of this memory chunk and keep an array of offsets)

each time when you upload your array to GPU record device pointer (or device offset) in a CPU-based array.

Then upload this CPU-based array of device pointers or device offsets to Device.


I will try this.

Have you actually tried this? I don’t think this will work as you’ve described it. What you get back from cudaMalloc is a “pointer” stored on the host that maps to memory allocated in the GPU. I do not think that the pointer is actually a meaningful pointer to GPU memory. It almost might be more meaningful for cudaMalloc to take something like a cudaMemLookupKey_t * instead of a void **.

Anyways, this pointer, if placed in an array in host memory and copied to memory allocated on the GPU will not be useful when you try to dereference it on the GPU. Or so I think. :)

But the offsets should work, though.

I have tried this now and it seems to work ok. I maintain an array of the pointers returned by cudaMalloc on the CPU, I then copy this array over to the GPU and dereferencing it works as expected.

It works for me too.
Also, remember to clean all the pointers in that array once you are done.

I have run into a problem with this approach. As my models grow the numbeof pointers I need on the device obviously also grows. I seem to get problems at around 1000 pointers.

Does anyone know if there is a limit to how many pointers you can create in global memory? My initiall ghought was that this cannot be a problem, but now I am not so sure anymore.

Due to the fact that there’s no MMU on GPU (maybe not true, just guess), so you will run into memory fragmentation after numbers of allocation and deallocation (various size). And also in my own experience, I dont think the CUDA allocator is reliable for frequent allocation/deallocation.

In addition, when you allocate a block of GPU memory, GPU does something behind the scene, which takes some time. For example, if you try to allocate 600MB on 8800GTX, it would takes up to 20ms! The initialization time is depending on the allocation size.

So I came up to a solution: write your own memory allocator on GPU. Basically you can reserve a large block of memroy on GPU first, and slice it into chunks for application use. Also when there’s fragmentation problem (no free block is greater then requsted size), I do the de-fragmentation by compacting blocks. This works great for me and makes the allocation/deallocation process in constant time.

By the way, just to clearify, the pointer address returned by cudaMalloc is the actual memory address sued on GPU. Not sure if this is true for next-generation card like GT200, but at least true for G80/G92 cores.