Introducing Low-Level GPU Virtual Memory Management

jwitsoe · August 25, 2020, 11:53pm

Originally published at: Introducing Low-Level GPU Virtual Memory Management | NVIDIA Technical Blog

There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to developers has been limited to the malloc-like abstractions that CUDA provides. CUDA 10.2 introduces…

Nan1215 · December 4, 2020, 5:48pm

I’m a beginner of low-level-gpu-virtual-memory-management. I learned the vecAdd example. I think it wouldn’t be hurt to confirm my understanding with you here.

Low-level-gpu-virtual-memory-management can avoid overhead from message passing interface, such as nvshmem. Instead of put/get data from remote memory, Low-level-gpu-virtual-memory-management allows N device (assume N device on one node) to access the virtual memory. Would that be correct?
I’m thinking about how to refactor an existing MPI code to low-level-gpu-virtual-memory-management code. I’m confused about how to do it. If I still want to keep the MPI concept in the code, let’s say one MPI rank per gpu, but instead of using MPI, I want to use low-level-gpu-virtual-memory-management, is that feasible? I simply tried multi-MPI of vectorAddMMAP, however, it just repeated N times when running with N MPI. Any suggestions of using low-level-gpu-virtual-memory-management in a MPI code?

killogge · December 7, 2020, 6:42pm

Hello Nan1215, thanks for commenting! I’ve replied to each of your questions below, respectively:

Even without the CUDA Virtual Memory Management APIs, you can map remote device memory (aka peer memory) from the same node and access it directly through a pointer via the use of the runtime API cudaEnablePeerAccess (but, notice the sections detailing pain points with this call described in the blog post). From the device, these two APIs will map peer memory the same way, and I believe internally NVSHMEM will internally do the same thing when memory is on the same node. That said, currently no CUDA APIs support multi-node, which is where NVSHMEM scales beyond what CUDA provides.
The vectorAddMMAP sample doesn’t do any cross-process communication like an MPI application would. You could limit the devices used based on the MPI rank if you modified vectorAddMMAP, but that would just replicate the same app to different GPUs. If you’re looking for inter-process communication, you should look at the memMapIPCDrv CUDA samples. Also, I believe CUDA-Aware OpenMPI intends to support (if it doesn’t support it already) memory allocated with the CUDA Virtual Memory APIs, so if your application leverages CUDA-Aware OpenMPI, you may not need application changes to leverage that support, but it may require a certain version of CUDA-Aware OpenMPI (I’m not sure exactly what version, I can find this out if you need).

Please let me know if this answers your questions, or if you have additional questions!

Nan1215 · December 8, 2020, 1:17am

Hi Killogge,

Thanks for your explanations. One follow up question: does virtual memory API support the below case:

For example, I have 3 gpus, and each assigns 1 MPI rank. That means I have 3 MPI ranks in total. Is it feasible to let 3 MPI ranks access the one virtual memory simultaneously in one cuda stream?

killogge · December 10, 2020, 4:16am

So, I might be failing to understand your follow question, if so forgive me. As I understand it with MPI, each rank is it’s own process (either on the same node or across nodes), and CUDA streams themselves are not shareable across process boundaries. That said, on some platforms CUDA does support what we call Inter-Process Events, which can be created via the cudaEventCreate / cuEventCreate APIs. With these types of events you can easily synchronize each rank’s stream using the standard cudaEventRecord / cudaStreamWaitEvent APIs. This is all outside the scope of the CUDA Virtual Memory Management APIs mentioned earlier.

CUDA-Aware OpenMPI will transparently perform the requested MPI operations for you across ranks, and even across nodes if necessary. Check out the link above for CUDA-Aware OpenMPI for an example as well as more information. As I said, I’m not sure the use of the new CUDA Virtual Memory Management APIs are currently supported in the latest version of CUDA-Aware OpenMPI, but I believe it will be eventually.

As to your particular use case of three ranks all accessing the same physical memory (as each rank wouldn’t necessarily have the same virtual address, since they are different processes after all), yes, that is feasible and is in fact the main use case for supporting these IPC API mechanisms.

I hope this answers your questions, please feel free to ping us back if you have more questions.

bloch.aurelien · December 21, 2020, 5:38pm

Hi,

I am trying to improve an implementation of a FIFO buffer CPU to GPU and vice versa.
I was wondering how we can use Virtual Memory Management and Unified Memory (with the features of compute capability 7.x) at the same time as it seems not possible at the moment? Does the PINED memory or the limited option in cuMemSetAccess prevent the CPU to access the data without using memcpy ?

Thanks

Cory.Perry · December 28, 2020, 11:58pm

Hi bloch.aurelien,

Unfortunately, the Virtual Memory Management APIs don’t currently support Unified Memory nor CPU memory as of yet. We do have some plans to support this eventually, and yes, the fields like type and location in the CUmemAllocationProp were meant to make these kinds of decisions explicit such that they can be extended later.

That said, to deal with your specific use case, one idea might be to use a combination of cudaHostRegister and OS-specific calls to manage CPU VAs. Unfortunately, not all platforms will give back the same device VA for a CPU VA (see cudaHostRegister / cuMemHostRegister for more information), but support for this can be checked for.

Hope this helps you out, feel free to ping us back if you have more questions! Happy Holidays!

neoblizzz · February 23, 2021, 4:40pm

Hey Cory!

I am experimenting with the new API and had a question about its usage within multiple GPUs;
Considering your striped example. If I have a contiguous virtual memory range, and I would like to use the cuMemSetAccess to set access for all the mapped devices. Should I be able to set access for stripe_size such that the resident device (the device with the physical memory) gets read-write access and all other devices (remote devices) get read-only access?

My loops looked like the following but it doesn’t seem to work;

for (std::size_t r_idx = 0; r_idx < phys.resident_devices.size(); r_idx++) {
        for (std::size_t idx = 0; idx < mapping_devices.size(); idx++) {
                access_descriptors[idx].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
                access_descriptors[idx].location.id = mapping_devices[idx];

                // If the device being mapped to is where the physical memory resides
                // use the read-write access flag, otherwise, use read-only
                if(mapping_devices[idx] == phys.resident_devices[r_idx])
                    access_descriptors[idx].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
                else
                    access_descriptors[idx].flags = CU_MEM_ACCESS_FLAGS_PROT_READ;
        }

        // XXX: Can I set access for only a portion of the range at a time, but eventually equal to padded_size?
        cuMemSetAccess((CUdeviceptr)virt.ptr + (stripe_size * r_idx),
                        stripe_size, 
                        access_descriptors.data(),
                        access_descriptors.size());
}

Let me know if the valid thing is to go read the blog again because I read it a while ago and may have missed this. Really cool stuff btw, thank you!

Edit: The docs suggest the following, can you elaborate if this means that what I suggested is not possible and you have to set access for the entire range?

The range must be a fully mapped address range containing all allocations created by cuMemMap / cuMemCreate.

Edit2: My fault, on two GV100s accessing a read-only VM range just exited with no apparent errors. The above code worked flawlessly after I stopped accessing read-only memory on remote devices.

Cory.Perry · February 23, 2021, 11:41pm

Hi neoblizzz!

Thanks for trying out the new APIs. Yeah, at the moment CU_MEM_ACCESS_FLAGS_PROT_READ, while defined, is not currently implemented just yet and will return CUDA_ERROR_NOT_SUPPORTED if used. We plan to implement support for it very soon along with some possible performance optimizations when used in such a way, so stay tuned!

As to mapping a portion of an allocation (which is what I think you’re asking about in your code snippet), that also is not currently supported as of yet, along with a non-zero offset. This is also something we are hoping to support very soon!

If you have any other suggestions or requests for this API, please let us know and we’ll see if we can’t implement them in a future CUDA release. Thanks for your feedback, hope the above helps!

neoblizzz · February 24, 2021, 12:27am

Ah, makes so much more sense now! Thank you for answering my questions. I do have two points of feedback that I came up with right after messing with it for a bit:

Since everyone is familiar with managed memory, something like a read-mostly hint (duplicate on read) will be really nice. I am able to imitate this using the current APIs by allocating the size of the physical array multiplied by the number of GPUs. That way, each stripe gets the full array and I have to do cudaMemcpy for whatever data I want a number of times to manually duplicate it. But if there was a property for this within the API much like what malloc managed did with the hints, that will be really cool!
Will it be ever possible to Map and Unmap memory within the kernel (or if this is already possible because I suppose these driver-level calls are host-only). I am experimenting with some sparse-kernels where the output isn’t known before the execution (you can thing of SpGEMM as an example), and would like to map/unmap pre-allocated memory to mimic dynamic allocation within the kernel as the kernel learns what the output size is going to be.

Again, thank you for answering my questions. These APIs are really really cool. Looking forward to messing with them a bit more and the future updates.

Cory.Perry · February 24, 2021, 1:05am

Thanks for the feedback! Let me respond to the items you listed as best I can.

[Summary]: Read duplication support with the CUDA Virtual Memory Management APIs

Interesting idea, sort of like how graphics allocators handle SLI implicitly? Definitely something we’ll keep in mind for sure as new use cases come up. We do plan on supporting managed memory in the future, so this request might come as part of that eventually, so we’ll see!

Will it be ever possible to Map and Unmap memory within the kernel…

Unfortunately these APIs must translate to OS level system calls that manipulate the GPU virtual address space which is managed by the operating system, so without some kind of CPU involvement, this request would be difficult to implement. There are other concerns as well, but this would be the main one I can see. We’ll definitely keep it in mind for sure as we move forward, but I wouldn’t expect this to be readily available any time soon.
Alternatively, if you’re looking for device-side dynamic allocation of memory, you might like to consider the device-side malloc() implementation we have today. Recent improvements in scalability and performance have made this a more viable option for similar use cases, the only issue is you need to move your pre-allocations to the internally managed heap for the device with cuCtxSetLimit.
Another option is a feature recently released similarly tied to the CUDA Virtual Memory Management APIs described here called Sparse Textures that might be of use. While geared more toward the graphics side, it might be something that could be manipulated to fit your use case potentially. Unfortunately I don’t have a sample readily available to give you on the use of these APIs, but if you’re interested we can try to put one together for you.

Looking forward to seeing what kinds of things you build with these new features!

neoblizzz · February 24, 2021, 4:18am

Cory.Perry:

Alternatively, if you’re looking for device-side dynamic allocation of memory, you might like to consider the device-side malloc() implementation we have today. Recent improvements in scalability and performance have made this a more viable option for similar use cases, the only issue is you need to move your pre-allocations to the internally managed heap for the device with cuCtxSetLimit.
Another option is a feature recently released similarly tied to the CUDA Virtual Memory Management APIs described here called Sparse Textures that might be of use. While geared more toward the graphics side, it might be something that could be manipulated to fit your use case potentially. Unfortunately I don’t have a sample readily available to give you on the use of these APIs, but if you’re interested we can try to put one together for you.

Both of these suggestions seem promising, from my past experimentation at least it seems like device-side malloc() was not sufficient enough to compete with other algorithmic approaches that addressed sparse-problems with unknown output sizes. Sparse textures I had no idea about, looks interesting, I’ll check these out! Thanks again!

yvlamg · March 16, 2021, 10:39pm

can I have two VA ranges mapped to one physical allocation on one CUDA device?

Cory.Perry · March 16, 2021, 10:57pm

So, the API doesn’t prevent such use. We call this “Virtual Aliasing”, and the CUDA Virtual Memory Management APIs do allow for it, but coherency between the different virtual mappings are not well defined. There should be an update to the CUDA Programming Guide coming soon that should explain the guarantees that are made today, but the general idea is, no two accesses to different addresses mapped to the same physical allocation can be guaranteed to be coherent within the same grid (even different threads, warps, etc in the same grid), or with any other grid running concurrently on the same device. It’s a little more complicated than this, but rest assured a programming model guide update will be coming to properly address this in the near future! Hope this helps!

Nanodeoclus · March 18, 2021, 12:19pm

Hi,

I have a long-running application which concurrently runs various tasks on GPUs, some of them requiring fairly large buffers, say, 4-8 GiB. Buffer sizes vary between tasks, and the possible required buffer sizes are not known at start time and can change during runtime. This means the application might start off with a task requiring 3 GiB, creating 5 buffers so 5 tasks can execute concurrently. Later on a new task requiring 5 GiB may be added, and I want to be able to reassign the same physical memory which was used in the 3 GiB buffers to 5 GiB buffers on demand rather than run out of memory on a 16 GiB GPU.

Memory usage of tasks also varies over their lifetime, e.g. an operation in the middle or at the end of the task may need an extra 2 GiB on top of 3 GiB required for the entire duration of the task (not necessarily in the same contiguous VA range). Ideally I’d want to time-share those 2 GiB efficiently, too.

My idea is to allocate a number of large VA ranges (each sized to the maximum buffer size I will ever need) and a number of fixed-size physical memory blocks (say, 16 or 64 MiB). When I want to run a task I take a VA range and map as many physical blocks to it as the task will need permanently. If running out of physical blocks when multiple tasks want to run I might try to allocate more, and if that fails, map the same physical blocks to multiple VA ranges and synchronise access to those blocks between tasks using CUDA events. If the task needs additional temporary memory, I give it another VA range and map physical blocks to it, making sure that no physical blocks are shared between the two VA ranges of the task (as that would deadlock).

Does this sound like a sensible scheme? Is it possible with the current VMM API?

Can a physical memory handle be mapped to more than one VA range? Judging by the previous answer I believe that’s a yes, and it’s safe to do so for my use case as I don’t expect any coherency between the two VA ranges (I only want to share the backing store, not data between VA ranges). Is that right?

Can a physical memory handle be mapped multiple times within the same VA range (e.g. to create a ring buffer)?

Cory.Perry · March 18, 2021, 3:46pm

A lot of interesting stuff in this post, I hope I can answer these questions fully. Short answer to your query is “Yes, this is all very reasonable and a highly encouraged use case”. Let me pick apart your post for the longer answer.

My idea is to allocate a number of large VA ranges (each sized to the maximum buffer size I will ever need) and a number of fixed-size physical memory blocks (say, 16 or 64 MiB).

Sounds reasonable to me, we actually have a sample that partially covers your idea in this blog post – the resizing of an allocation section, with performance metrics as well. One issue you may run into depending on the size of the physical memory blocks and the shear number of them mapped in a contiguous VA is the fact that some APIs like cuMemcpy* and cuMemset* may scale (on the CPU) with the number of physical allocations in the specified range. This has to do with the fact that the physical allocations within the range don’t necessarily belong to the local device doing the copy (e.g. peer mapped memory) and the driver needs to detect this to perform the correct copy operation. The resulting copy by the GPU should not necessarily be affected. The larger the “chunks” within the requested VA, the less of an impact this has on the CPU performance of these calls.

Can a physical memory handle be mapped to more than one VA range?

Yes, but please note the reply referenced above. I believe what you have proposed so far is within the scope of the programming guide updates that are to come soon.

Can a physical memory handle be mapped multiple times within the same VA range (e.g. to create a ring buffer)?

I believe you’re asking if I can do something like the following:

cuMemAddressReserve(&ptr, sz * 2);
cuMemMap(ptr, handle, sz);
cuMemMap(ptr + sz, handle, sz);

Yes, but again, please note the coherency issue described above. I reiterate this point as it can be very difficult to fully grasp the ramifications, not to mention debugging the issue should it come up in your application. As of this writing, accesses to memory mapped in such a way is undefined by the CUDA Memory Model, but with an upcoming update to some of the wording, we will partially address this and of course provide further improvements in the future. The information on the comment above is more of an “unofficial” answer to what CUDA guarantees in this regard. Stay tuned for the official answer! :)

Hope this helps, let me know if I missed anything :)

Nanodeoclus · March 18, 2021, 4:19pm

Thanks, this is all very useful!

One issue you may run into depending on the size of the physical memory blocks and the shear number of them mapped in a contiguous VA is the fact that some APIs like cuMemcpy* and cuMemset* may scale (on the CPU) with the number of physical allocations in the specified range.

Is this dependent on the number of physical memory blocks in the entire VA, or only within the range affected by the copy/memset operation? The only Memcpy operations I have (apart from filling buffers containing constant data once) are relatively small HtoD/DtoH copies (tens of MiB at most) at the beginning and end of each task. The bulk of the buffers is only required to hold intermediate results. If necessary I could split input and result buffers from the rest of the working memory.

Cory.Perry · March 18, 2021, 4:21pm

Is this dependent on the number of physical memory blocks in the entire VA, or only within the range affected by the copy/memset operation?

Only within the range affected by the copy/memset operation, yup. :)

omri4 · April 6, 2021, 3:18pm

Hey Cory, hoping you could help me out here.

Some background:
I’ve forked from a big open-source project and have extended it for my purposes.
Part of my extension was to implement some computation using CUDA. In my endeavor to optimize this computation, I’ve benchmarked several different methods of computing. The most performant one was using the CUDA graph API.
However, the CUDA graph API is a bit limited, and at its current state, it forced me to use a redundant memcpy.
This memcpy was a bottleneck for my performance. In order to avoid memcopying, I’ve implemented what you’ve called “Virtual Aliasing” earlier in this thread.
I’ve overriden the project’s GPU memory allocator in order to allocate GPU memory using the Virtual Memory Management API. This allows me to map two virtual addresses to the same physical memory, thus sparing the redundant memcopy.
I know this is quite bad, as the allocation granularity of the virtual memory management APIs is very big. However, once cuMemMap supports mapping to a non-zero offset, I can implement a heap over the big buffers allocated (the project already did this, over cuMemAlloc). For now, I’m not running into memory consumption problems, so this isn’t an issue.

My problem:
Up until now, I’ve ran my application on a single GPU. I’m trying to scale up to multiple GPUs.
The open-source project tries to copy memory between devices, using cuMemcpyDtoDAsync.
cuMemcpyDtoDAsync fails with CUDA_ERROR_INVALID_VALUE.
If I correctly understand, and I hope you can confirm, in order for this API to work, two conditions have to be satisfied:

Device peer access must be supported (and enabled).
When calling cuMemSetAccess, I have to enable access for both devices.

In my setup, device peer access isn’t supported, so I’ve tried to substitute calls to cuMemcpyDtoDAsync with calls to cuMemcpyPeerAsync. This solution seems to work, but isn’t viable - because I cannot substitute all of the cuMemcpyDtoDAsync calls (cuMemcpyPeerAsync requires additional parameters which aren’t accessible from everywhere in the code-base).

As I understand from Programming Guide :: CUDA Toolkit Documentation, the Virtual Memory Management APIs manage the “unified virtual address space”. Is this different from the “Unified Memory system”? The naming is quite confusing. Are these systems even related?

All in all, I have several questions:

How does device-to-device memcopies behave when using the Virtual Memory Management APIs?
Is my understanding of why cuMemcpyDtoDAsync fails correct? If I managed to enable peer access, would that make the API work? Would connecting my two GPUs with NVLink enable peer access?
Is the Unified Memory system even related to the unified VA space? I’ve tried looking into APIs like cuMemAllocManaged and cuStreamAttachMemAsync, but just now I’ve realized they are probably totally unrelated.
Do you have any other idea on how to solve this problem?

— Omri

Cory.Perry · April 8, 2021, 7:11pm

Hi omri4!

A lot of information here, thank you for being so detailed in your question, I’ll try to answer as best I can!

This allows me to map two virtual addresses to the same physical memory, thus sparing the redundant memcopy.

That’s great! Please keep in mind the caveats mentioned above, they can be tricky to diagnose as a problem if you’re not careful.

However, once cuMemMap supports mapping to a non-zero offset, I can implement a heap over the big buffers allocated (the project already did this, over cuMemAlloc).

Noted, I’ll try to update this forum once this support goes in, it has been heavily requested as noted earlier :)

If I correctly understand, and I hope you can confirm, in order for this API to work, two conditions have to be satisfied:

So, part of the article here covers this aspect, but I’ll outline it here: you don’t need to use cudaEnablePeerAccess in order to enable peer access to an allocation made with the CUDA Virtual Memory Management APis. You just need to call cuMemSetAccess() and specify the peer GPU you wish to have access to, regardless if you have called cudaEnablePeerAccess or not, otherwise the memory is not accessible to the peer GPU. Also keep in mind the limitations certain system configurations place on outstanding peer accesses, as most directly connected p2p configurations only support accessing at most eight gpus at a time across the entire system.

As I understand from Programming Guide :: CUDA Toolkit Documentation, the Virtual Memory Management APIs manage the “unified virtual address space”. Is this different from the “Unified Memory system”? The naming is quite confusing. Are these systems even related?

Yes, I can see the confusion, let me try to explain. These are two different systems at work:

Unified Virtual Address Space is the driver’s attempt to synchronize the address space between the CPU and the GPU. This is to allow an application to call, say, cuMemAllocHost() and get a pointer that can be accessed by the CPU, and the same address can be passed to the device without having to translate it first with cuMemHostGetDevicePointer(). We also support the cuMemHostRegister API, where the CPU virtual address is chosen by the application in some way, thus all allocations made with CUDA on a Unified Address Space system will only use an address that was available on the CPU. In the case of cuMemAlloc and the CUDA Virtual Memory Management APis, this typically means we use reserved CPU virtual addresses internally in order to block functions like mmap/VirtualAlloc and malloc from allocating those virtual addresses and causing confusion.
Unified Memory is essentially everything related to using cuMemAllocManaged APis, allowing migratable memory between CPU and GPU. As a consequence of the programming model exposed, Unified Memory requires Unified Virtual Address Space to function.

The programming guide links provide a lot more detail than I can reasonably put in a blog post here, but hopefully this answers your question.

How does device-to-device memcopies behave when using the Virtual Memory Management APIs?

The general answer to this question is a bit more complicated. I believe I answered your peer device error issue above, let me know if I did not, so I’ll focus on how peer memcpies are handled assuming you resolve the error mentioned above. There is the case of a VA range with “chunks” physically located on multiple GPUs at a time and mapped on only some GPUs. In order to complete the memcpy, the driver will look for the common subset of devices that are able to access the full VA range and also have a context specified by the memcpy operation (either via the currently set context, the context associated with the stream passed to it, or the context associated with the memory operands) and perform the best memcpy operation it can (utilizing whatever hardware features it can, like asynchronous copy engines or launching a memcpy kernel, thus utilizing the SMs, just like standard cuMemAlloc memory). If there are multiple options available here, it is implementation defined what device/context is picked to actually perform the copy, but the ultimate fallback is usually to use the current device set via cuCtxSetCurrent / cudaSetDevice or error.

Is my understanding of why cuMemcpyDtoDAsync fails correct? If I managed to enable peer access, would that make the API work? Would connecting my two GPUs with NVLink enable peer access?

Peer access configuration is platform dependent (for example, some GPUs don’t support PCI-e peer access, some only NVLINK), and you can query the application for it’s support between two devices via the Peer Context Memory Access APIs. More information on setting up your system for peer access configuration can be found on our programming guide page.

Phew, that was a long post. I think I answered all the questions, please let me know if there’s anything I missed!

Topic		Replies	Views
Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager Technical Blog	9	925	March 27, 2021
Unified Memory in CUDA 6 Technical Blog	87	1896	August 16, 2019
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Technical Blog	12	1195	September 12, 2023
CUDA 4.0 CUDA Programming and Performance	63	507394	March 28, 2013
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9509	December 4, 2017
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1200	May 14, 2019
Using Shared Memory in CUDA C/C++ Technical Blog	36	1954	October 8, 2020
Dynamic Heap initialization CUDA Programming and Performance	12	288	June 24, 2024
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8088	June 30, 2010
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37243	August 30, 2009

Introducing Low-Level GPU Virtual Memory Management

Related topics