Auto-transfer memory from CPU when needed

Hi guys.

I have application that uses a lot of textures here (lets say, way more even Quadro k6000 can handle).

Right now I am doing the following - I transfer texture tiles (as much as I can), I run the app, mark the ones that a missing, transfer some results from the GPU to the CPU, load manually the tiles I need, and bring them back to the GPU (don’t worry about the performance, it is fine). Of course, I am having tons of issues with that.

Is there a way, I can force CUDA to do that for me ? I can use plain arrays too (not necessary textures).

Ideally, I would like to do the following :

  1. Give the CUDA API a massive amount of memory.
  2. Use this memory on the GPU. When CUDA driver see memory that is not present, block everything, remove some of the already transferred and continue.

I haven’t found such stuff. If there is any, please share a link with a documentation.


What do You mean by “memory […] is not present”?

I want to ask if this is possible :

  1. On the CPU, I want to tell CUDA : Load “lazy” these 10 gigs of mem. Transfer as much of it as you want to the GPU.

  2. On the GPU : I am randomly accessing memory on the GPU. If it is not transferred yet (aka not present), block everything, remove some of the already transferred and instead of it, get the one I requested.

I hope this makes it more clear.

As far as I know the only way to do something such as that is to do it yourself. Zero-Copy which is the closest thing I can think of that CUDA provides requires page-locked memory of which trying to allocate 10 gigs could be dangerous, and depending on what you are doing with it could have huge performance implications (it is comparatively glacial compared to on-board memory).

One of my bits of software has a similar problem. X number of images of a fixed size but only Y fit on the GPU, so I have to do some manual paging. It is not a difficult task if your problem is iterative in nature.

Check whether next image is on GPU (device pointer being NULL in my case).
If it is, continue, else clear an existing one from the pool at random (or FIFO queue or whatever other method). This is simply setting the pointer on the old image to NULL, but keeping the memory allocated.
cudaMemcpy the next image into the newly available slot.
Do iteration.

In some cases you can even hide the copying across of data by exploiting concurrent kernel execution and copy.


Hi Tiomat,
Thanks for the answer.

I am already doing the manual paging and so on and am using the page-locked memory for some stuffs, I just thought there could be something else that I am missing, since it sounds like a common feature to have (and thus I am spending hours of implementing already implemented stuff).