When is 'virtual memory' available in CUDA ?

Hi, i’m wondering whether the ‘virtual memory’ feature (meaning that if the GPU memory is fully occupied, currently un-used memory areas in the GPU memory are ‘swapped’ out to Host(CPU RAM) memory) is available in CUDA ? Maybe with the next GPU generation ‘FERMI’ ?
I consider it as an really important feature - it is annoying if a kernel quits execution because not enough GPU memory is available. The better alternative would be that it would get slower (due to swapping).

Look at pin memory… I myself just break the data to pieces and move them as needed to the GPU (my datasets are usually far higher than 6GB).


No, there is no hardware support for virtual memory on any current GPU or Fermi. As the previous poster points out, you can always implement your own paging in the application.

It is also possible to use so-called “zero-copy” which allows you to read CPU memory directly across the PCIe bus, although with a much higher latency. See the programming guide for details.

CUDA also doesn’t support dynamic memory allocation from kernels, so I’m not sure how a kernel would quit due to out of memory.

It’s worth noting that GPU memory sizes will continue to increase too.

you are right - allocating GPU memory inside a kernel is not possible. I meant allocating GPU memory in the routine which calls the kernels.
It just would be fine if the user could be freed from the task of taking care of the GPU memory. Virtual memory would be very comfortable, and one is used to it since it is available for CPU memory for very long.
And for an real application it is not acceptable if the application ‘breaks’ because it got out of GPU memory.
Note you can not predict how much GPU memory you will use in an application (to be able to pre-allocate it) - depends on the program logic etc. Of course GPU memory size is increasing, but e.g. 1GB for the current GX280 is in fact ‘nothing’ if you are doing image/video processing…
best regards, Hannes

as Simon corrected me - this is what zero-copy and pinned memory all about.

my CPU application also “breaks” if I allocate a 17GB RAM on a 16GB ram machine which is also diskless (a valid production environment)

there is nothing new or different here from GPU and CPU. If your application needs too much memory and the system (read GPU/CPU) can’t provide your

application will fail - Unless you imploy a certain chunk mechanism as I’ve mentioned above.

Not so true. You know exactly how much memory you’re going to use. You code this in you (new/malloc/cudaMalloc…) when you get to the line in code

that needs to allocate memory you know very well how much you need. Again the solution would be to put some code to determine at runtime how much memory

you need and then to break your algorithm accordingly to work in chunks.

Seismic datasets can get to tens of Gigabytes, you can’t expect the GPU to have so much RAM on it (and neither the CPU for that matter),

you just have to break your data into chunks each fitting into the available amount of memory on your system.


Of course I know how much memory i will need at the moment to execute a specific kernel. But when working with many different libraries it is a lot of work to modify all of them approbiately to ensure this ‘paging’/‘chunk’ mechanism at all places where memory is allocated. We are using currently e.g. the CUDPP library and some libraries from an university, in the future definitly the NVPP, CUFFT, some LAPACK libs and lots of other useful libraries.

Ah, I see. GPU memory virtualization at the kernel level might be possible, although with the current programming model it’s not obvious to me how the driver would know which memory pages each kernel would require.

It worth noting that Windows Vista already does some level of GPU memory virtualization - I believe multiple applications can allocate close to the whole GPU memory and it will manage swapping data back to main memory.

Yes, in the new ‘WDDM’ for Vista/Windows 7 there seems to be some kind of GPU memory virtualization. I’m not sure whether it applies also for CUDA applications or only for DirectX.

This could be exactly what i’m looking for.