Introducing Low-Level GPU Virtual Memory Management

Originally published at: https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

Figure 1. Example of using the cuMem* CUDA APIs to resize a GPU buffer. There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to developers has been limited to the malloc-like abstractions that CUDA provides.  CUDA 10.2 introduces…

1 Like

I’m a beginner of low-level-gpu-virtual-memory-management. I learned the vecAdd example. I think it wouldn’t be hurt to confirm my understanding with you here.

  1. Low-level-gpu-virtual-memory-management can avoid overhead from message passing interface, such as nvshmem. Instead of put/get data from remote memory, Low-level-gpu-virtual-memory-management allows N device (assume N device on one node) to access the virtual memory. Would that be correct?

  2. I’m thinking about how to refactor an existing MPI code to low-level-gpu-virtual-memory-management code. I’m confused about how to do it. If I still want to keep the MPI concept in the code, let’s say one MPI rank per gpu, but instead of using MPI, I want to use low-level-gpu-virtual-memory-management, is that feasible? I simply tried multi-MPI of vectorAddMMAP, however, it just repeated N times when running with N MPI. Any suggestions of using low-level-gpu-virtual-memory-management in a MPI code?

Hello Nan1215, thanks for commenting! I’ve replied to each of your questions below, respectively:

  1. Even without the CUDA Virtual Memory Management APIs, you can map remote device memory (aka peer memory) from the same node and access it directly through a pointer via the use of the runtime API cudaEnablePeerAccess (but, notice the sections detailing pain points with this call described in the blog post). From the device, these two APIs will map peer memory the same way, and I believe internally NVSHMEM will internally do the same thing when memory is on the same node. That said, currently no CUDA APIs support multi-node, which is where NVSHMEM scales beyond what CUDA provides.

  2. The vectorAddMMAP sample doesn’t do any cross-process communication like an MPI application would. You could limit the devices used based on the MPI rank if you modified vectorAddMMAP, but that would just replicate the same app to different GPUs. If you’re looking for inter-process communication, you should look at the memMapIPCDrv CUDA samples. Also, I believe CUDA-Aware OpenMPI intends to support (if it doesn’t support it already) memory allocated with the CUDA Virtual Memory APIs, so if your application leverages CUDA-Aware OpenMPI, you may not need application changes to leverage that support, but it may require a certain version of CUDA-Aware OpenMPI (I’m not sure exactly what version, I can find this out if you need).

Please let me know if this answers your questions, or if you have additional questions!

Hi Killogge,

Thanks for your explanations. One follow up question: does virtual memory API support the below case:

For example, I have 3 gpus, and each assigns 1 MPI rank. That means I have 3 MPI ranks in total. Is it feasible to let 3 MPI ranks access the one virtual memory simultaneously in one cuda stream?

So, I might be failing to understand your follow question, if so forgive me. As I understand it with MPI, each rank is it’s own process (either on the same node or across nodes), and CUDA streams themselves are not shareable across process boundaries. That said, on some platforms CUDA does support what we call Inter-Process Events, which can be created via the cudaEventCreate / cuEventCreate APIs. With these types of events you can easily synchronize each rank’s stream using the standard cudaEventRecord / cudaStreamWaitEvent APIs. This is all outside the scope of the CUDA Virtual Memory Management APIs mentioned earlier.

CUDA-Aware OpenMPI will transparently perform the requested MPI operations for you across ranks, and even across nodes if necessary. Check out the link above for CUDA-Aware OpenMPI for an example as well as more information. As I said, I’m not sure the use of the new CUDA Virtual Memory Management APIs are currently supported in the latest version of CUDA-Aware OpenMPI, but I believe it will be eventually.

As to your particular use case of three ranks all accessing the same physical memory (as each rank wouldn’t necessarily have the same virtual address, since they are different processes after all), yes, that is feasible and is in fact the main use case for supporting these IPC API mechanisms.

I hope this answers your questions, please feel free to ping us back if you have more questions.

Hi,

I am trying to improve an implementation of a FIFO buffer CPU to GPU and vice versa.
I was wondering how we can use Virtual Memory Management and Unified Memory (with the features of compute capability 7.x) at the same time as it seems not possible at the moment? Does the PINED memory or the limited option in cuMemSetAccess prevent the CPU to access the data without using memcpy ?

Thanks

Hi bloch.aurelien,

Unfortunately, the Virtual Memory Management APIs don’t currently support Unified Memory nor CPU memory as of yet. We do have some plans to support this eventually, and yes, the fields like type and location in the CUmemAllocationProp were meant to make these kinds of decisions explicit such that they can be extended later.

That said, to deal with your specific use case, one idea might be to use a combination of cudaHostRegister and OS-specific calls to manage CPU VAs. Unfortunately, not all platforms will give back the same device VA for a CPU VA (see cudaHostRegister / cuMemHostRegister for more information), but support for this can be checked for.

Hope this helps you out, feel free to ping us back if you have more questions! Happy Holidays!

Hey Cory!

I am experimenting with the new API and had a question about its usage within multiple GPUs;
Considering your striped example. If I have a contiguous virtual memory range, and I would like to use the cuMemSetAccess to set access for all the mapped devices. Should I be able to set access for stripe_size such that the resident device (the device with the physical memory) gets read-write access and all other devices (remote devices) get read-only access?

My loops looked like the following but it doesn’t seem to work;

for (std::size_t r_idx = 0; r_idx < phys.resident_devices.size(); r_idx++) {
        for (std::size_t idx = 0; idx < mapping_devices.size(); idx++) {
                access_descriptors[idx].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
                access_descriptors[idx].location.id = mapping_devices[idx];

                // If the device being mapped to is where the physical memory resides
                // use the read-write access flag, otherwise, use read-only
                if(mapping_devices[idx] == phys.resident_devices[r_idx])
                    access_descriptors[idx].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
                else
                    access_descriptors[idx].flags = CU_MEM_ACCESS_FLAGS_PROT_READ;
        }

        // XXX: Can I set access for only a portion of the range at a time, but eventually equal to padded_size?
        cuMemSetAccess((CUdeviceptr)virt.ptr + (stripe_size * r_idx),
                        stripe_size, 
                        access_descriptors.data(),
                        access_descriptors.size());
}

Let me know if the valid thing is to go read the blog again because I read it a while ago and may have missed this. Really cool stuff btw, thank you!

Edit: The docs suggest the following, can you elaborate if this means that what I suggested is not possible and you have to set access for the entire range?

The range must be a fully mapped address range containing all allocations created by cuMemMap / cuMemCreate.

Edit2: My fault, on two GV100s accessing a read-only VM range just exited with no apparent errors. The above code worked flawlessly after I stopped accessing read-only memory on remote devices.

Hi neoblizzz!

Thanks for trying out the new APIs. Yeah, at the moment CU_MEM_ACCESS_FLAGS_PROT_READ, while defined, is not currently implemented just yet and will return CUDA_ERROR_NOT_SUPPORTED if used. We plan to implement support for it very soon along with some possible performance optimizations when used in such a way, so stay tuned!

As to mapping a portion of an allocation (which is what I think you’re asking about in your code snippet), that also is not currently supported as of yet, along with a non-zero offset. This is also something we are hoping to support very soon!

If you have any other suggestions or requests for this API, please let us know and we’ll see if we can’t implement them in a future CUDA release. Thanks for your feedback, hope the above helps!

1 Like

Ah, makes so much more sense now! Thank you for answering my questions. I do have two points of feedback that I came up with right after messing with it for a bit:

  1. Since everyone is familiar with managed memory, something like a read-mostly hint (duplicate on read) will be really nice. I am able to imitate this using the current APIs by allocating the size of the physical array multiplied by the number of GPUs. That way, each stripe gets the full array and I have to do cudaMemcpy for whatever data I want a number of times to manually duplicate it. But if there was a property for this within the API much like what malloc managed did with the hints, that will be really cool!
  2. Will it be ever possible to Map and Unmap memory within the kernel (or if this is already possible because I suppose these driver-level calls are host-only). I am experimenting with some sparse-kernels where the output isn’t known before the execution (you can thing of SpGEMM as an example), and would like to map/unmap pre-allocated memory to mimic dynamic allocation within the kernel as the kernel learns what the output size is going to be.

Again, thank you for answering my questions. These APIs are really really cool. Looking forward to messing with them a bit more and the future updates.

Thanks for the feedback! Let me respond to the items you listed as best I can.

[Summary]: Read duplication support with the CUDA Virtual Memory Management APIs

  1. Interesting idea, sort of like how graphics allocators handle SLI implicitly? Definitely something we’ll keep in mind for sure as new use cases come up. We do plan on supporting managed memory in the future, so this request might come as part of that eventually, so we’ll see!

Will it be ever possible to Map and Unmap memory within the kernel…

  1. Unfortunately these APIs must translate to OS level system calls that manipulate the GPU virtual address space which is managed by the operating system, so without some kind of CPU involvement, this request would be difficult to implement. There are other concerns as well, but this would be the main one I can see. We’ll definitely keep it in mind for sure as we move forward, but I wouldn’t expect this to be readily available any time soon.
    Alternatively, if you’re looking for device-side dynamic allocation of memory, you might like to consider the device-side malloc() implementation we have today. Recent improvements in scalability and performance have made this a more viable option for similar use cases, the only issue is you need to move your pre-allocations to the internally managed heap for the device with cuCtxSetLimit.
    Another option is a feature recently released similarly tied to the CUDA Virtual Memory Management APIs described here called Sparse Textures that might be of use. While geared more toward the graphics side, it might be something that could be manipulated to fit your use case potentially. Unfortunately I don’t have a sample readily available to give you on the use of these APIs, but if you’re interested we can try to put one together for you.

Looking forward to seeing what kinds of things you build with these new features!

1 Like

Both of these suggestions seem promising, from my past experimentation at least it seems like device-side malloc() was not sufficient enough to compete with other algorithmic approaches that addressed sparse-problems with unknown output sizes. Sparse textures I had no idea about, looks interesting, I’ll check these out! Thanks again!