API support for changing device memory mappings

For anyone at NVIDIA.

Would it be possible to add CUDA driver support for changing the mappings of allocated memory regions? For example, say I have allocated two regions using something like cudaMalloc, where A* points to region A and B* points to region B. Would it be possible to include a function that would change the mappings such that A* points to region B and B* points to region A without changing the actual values of A* or B*? Even the ability to call cudaMalloc and specify the address being allocated to would be sufficient. I am hoping that GPUs implement some form of virtual memory and that this would be possible by changing some page table entries in the driver.

We are currently trying to add support for context switching from one GPU to another in Ocelot and this would significantly make my life easier. The main problem we are having is that after we copy all of the state from one GPU to another, any existing pointers to cudaMalloced regions are no longer valid. We are currently handling this via pointer analysis, but this only works in cases where pointers are passed as parameters to kernels and not embedded in other memory regions. More generally we could handle it by allocating all memory using the zero-copy mechanism, but we would like to avoid the performance overhead.

We have a similar issue in implementing checkpointing of CUDA applications. You can checkpoint GPU memory, but you can’t restore the checkpointed memory objects to the same addresses because you can’t control where each object is allocated in global memory by the standard cudaMalloc.

A workaround that we are currently using is to have a custom memory allocator. When an application is started, our checkpointer initializes a memory pool by taking most of free global memory (e.g., 90% of the physical capacity). It intercepts all of the calls to cudaMalloc and allocates a memory region of the specified size from the memory pool. This way, you can have 100% control of object allocations in global memory.

There is one caveat though. The address where the memory pool is allocated is nondeterministic due to the paging in global memory. So, if the new pool address is drastically far from the original address, you can’t restore data at the same addresses. However, in reality, since the pool is allocated in the very beginning of the application, this seems to be always the constant. And at least you can safely check whether it is actually the same when restarting applications.


Out of curiosity, how does that work if you have multiple programs running that all want to allocate nearly all the free memory? Does one take most, then the next takes most of the left overs, then the next takes most of those left overs, giving the next one less and less to work with? I’m honestly curious about it and don’t mean to insult.

I use a similar memory manager to what naoya is describing in my linear algebra codes, and my solution is simple - compute exclusivity. We keep all our dedicated compute GPUs in compute exclusive mode, which guarantees only one context per GPU at a time.


Are the range of addresses returned by cuda malloc relatively constant from one invocation of an application to another?

At least on Linux, with dedicated compute cards, it is. The first thing my codes do is allocate as much free ram as is available, and the starting address returned by cudaMalloc is extremely predictable. So predictable that successive runs of the same application get allocated blocks containing the results of the preceding run from global memory with perfect alignment (I have actually used that technique to recover the partial results of crashed or killed applications before).

When a display manager is running on a card, that piece of serendipity goes away. On a 896Mb GT200 card, the free memory available can vary by +/-100Mb, even running a minimal 2D window manager, and things move around in device memory quite a lot. On a WDDM operating system with active paging, I wouldn’t like to guess what happens.

This has been added to my list of things to investigate. Thanks all.