Introducing Low-Level GPU Virtual Memory Management

omri4 · April 11, 2021, 3:05pm

Thank you for the very detailed response, I appreciate it. It certainly enriched my understanding.

The NVLINK I ordered arrived today. It seems like I am able to use cuMemSetAccess in order to make cuMemcpyDtoDAsync work. (cuMemSetAccess failed with CUDA_ERROR_INVALID_DEVICE beforehand).

— Omri

Cory.Perry · April 14, 2021, 12:14am

Glad to help, I hope you’re happy with your purchase!

Yeah, in this case, cuMemSetAccess will fail with CUDA_ERROR_INVALID_DEVICE if any of the mapped allocations in the requested VA range’s physical location cannot be mapped by that device. You can check this easily at runtime by querying the attributes of the memory either with cuPointerGetAttribute or cuMemGetAllocationPropertiesFromHandle, retrieving the device ordinal for the source allocation and using cuDeviceGetP2PAttribute with the source device and the target device to see if P2P is supported.

Let us know if you have any more questions :)

omri4 · May 11, 2021, 7:12pm

Hey again Cory,
I’ve been having additional issues, wondering if you could shed some light.

I’ve been trying to understand performance issues that I’m having, I’ve created a simple program to debug the essence of it.
This program (which is attached) performs cuMemMap and cuMemUnmap in a for loop in a few threads.
If a single thread is running the average runtime for cuMemMap and cuMemUnmap is about 50us, while few threads are running the average runtime increases and sometimes even reaches 18ms!
I would like to ask if there are any reasons for that to happen and besides running as a single thread is there a way to avoid those runtime spikes.

— Omri
main.cpp (1.8 KB)

Cory.Perry · May 11, 2021, 8:16pm

Hey Omri, welcome back!

So, looking through your sample application, I see you’re manipulating the address space of different contexts, but the same device. The CUDA Virtual Memory Management APIs manipulate the process-wide unified VA space, so you’re going to get lock contention from each of those threads, both in manipulating the UVA space and the device’s own page tables. The locks protecting these internal resources aren’t fair either, so it wouldn’t be a straight (N*50 us) where N is the number of threads since threads can release the lock and re-acquire the lock in any of the other calls manipulating the VA space for that device.

That said, cuMemMap itself should be fairly fast barring lock contention. cuMemSetAccess and cuMemUnmap (assuming there is a READWRITE mapping on the address range requested) would be the costly APIs, as those interact with the OS to perform the page table updates.

Hope this helps!

omri4 · May 22, 2021, 4:56pm

Hey Cory, thanks for the detailed response.
It seems reasonable that lock contention or OS interaction might be the problem here.
We’re still having trouble solving the problem at hand. Do you happen to know if NVidia offers paid consultancy on these CUDA-internal issues?

— Omri

Cory.Perry · May 26, 2021, 5:17pm

Hey Omri,

We haven’t forgotten about you, don’t worry! One thing you can do is use any one of our profiling tools distributed with the CUDA Toolkit to profile your application. This should show in the timeline view any locks taken, even internal driver ones, which should give an understanding if there’s any lock contention.

As to paid consultancy on these issues, we’re talking with some folks and should be able to provide some information off this thread shortly.

omri4 · June 3, 2021, 9:48am

Hey Cory. Thanks for remembering us.

My team and I tried using NVidia’s profiling tools, as you suggested. We’ve used various tools, and narrowed down the causes of the spikes we’re experiencing.

cuMemUnmap and cuMemSetAccess do seem to hang in pthread_mutex_lock, which could cause starvation in the code example I’ve given above (given that Linux mutexes are unfair). However, in our actual use-case, most threads are not busy 100 percent of the time. Our internal calculations indicate that in our actual use-case, starvation wouldn’t be longer than 6.5ms, however we experience jitters of up to 15ms.
We profiled lock contention and ioctl times inside our framework. We hooked pthread and libc’s ioctl, and each time it took more than 1ms to acquire a lock, or perform a ioctl - we logged it. We saw direct correlation between long lock times and long ioctls. As I see this - it means long ioctls to NVidia’s driver made other APIs block, which is reasonable.
Specifically, we noticed that 3 ioctls seem to sometime cross the 1ms threshold (and up to 15ms):
0x21 and 0x22 to /dev/nvidia-uvm (cuMemSetAccess, cuMemUnmap).
0xc020462a to /dev/nvidiactl (this one is called from cuDeviceGetAttribute, which is called from libcufft).
Furthermore, we noticed that if we run nvidia-smi while our program runs, we instantly experience contention.

As usual, we have several questions:

Is there any reason to repeatedly call cuDeviceGetAttribute? Are any of the values subject to change between subsequent calls? Why not cache these values?
Is there any way you can think of, to reduce the jittering of the ioctls performed by the virtual memory APIs? What exactly do these ioctls do, perhaps we can change some Linux configuration to make them more stable?

— Omri

Cory.Perry · June 3, 2021, 5:17pm

This is great information, thank you!

Is there any reason to repeatedly call cuDeviceGetAttribute? Are any of the values subject to change between subsequent calls? Why not cache these values?

There are some dynamic values returned by cuDeviceGetAttribute, not all of them are static unfortunately. While there have been a few that were needlessly dynamic and have been fixed in recent drivers others like CU_DEVICE_ATTRIBUTE_CLOCK_RATE, CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE or CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT can change over time. Additionally, querying the amount of free memory available via cuMemGetInfo() also requires an ioctl to retrieve (and it’s highly recommended not to use this feature in production software).

Is there any way you can think of, to reduce the jittering of the ioctls performed by the virtual memory APIs? What exactly do these ioctls do, perhaps we can change some Linux configuration to make them more stable?

These APIs require OS interaction to perform the mapping operation. What you’re effectively asking is equivalent to asking “can the linux kernel reduce the jitter of the mmap() function call?”. In both cases, the OS takes shortcuts based on the available (and sometimes cached) resources on hand, and the apis scale with the size of the request. Finally, there’s the issue of the ioctls being serialized via locks as well which we are improving on in later drivers, but I don’t expect this to be fully parallelized in the near future. The shorter answer here is, we are always working on improving the performance and reducing jitter in our API calls, but I wouldn’t expect it to be eliminated or even reduced significantly in the near future.

Furthermore, we noticed that if we run nvidia-smi while our program runs, we instantly experience contention.

Yes, this is a known issue that is not specific to the CUDA Virtual Memory Management APIs. It isn’t just nvidia-smi, but any application that tries to access the kernel mode driver in various ways (i.e. another CUDA application) will take some shared kernel mode locks and be serialized with another process. The places where these shared locks are being taken are being reduced in later drivers, but there are a great many of such places.

Hope this helps!

omri4 · June 6, 2021, 3:33pm

Hey, thank you very much for your insightful answer.

These APIs require OS interaction to perform the mapping operation. What you’re effectively asking is equivalent to asking “can the linux kernel reduce the jitter of the mmap() function call?”.

This is, in fact, exactly what I’m asking. These API calls are opaque for me, I understand they require OS interaction, but I’m not sure which OS APIs do they use, exactly. The fact they actually interact with the operating system’s VM (i.e mmap) is great info. If you could further elaborate in what way do they interact with the OS, it would be incredible. We could try to fine-tune our OS to try and reduce this jittering. We’re currently looking at such solutions, but we’re doing so blindly.

Finally, there’s the issue of the ioctls being serialized via locks as well which we are improving on in later drivers, but I don’t expect this to be fully parallelized in the near future.

As I’ve said in my previous comment, I don’t think this is the issue we’re experiencing. Sure, in the sample application this should be a bottleneck - but in our actual use-case this shouldn’t be a problem, let alone the problem.
If this was an issue in our actual use-case, I would expect to see jittering in various other ioctls, not only these three.

— Omri

Cory.Perry · June 7, 2021, 4:26pm

If you could further elaborate in what way do they interact with the OS, it would be incredible. We could try to fine-tune our OS to try and reduce this jittering. We’re currently looking at such solutions, but we’re doing so blindly.

So, the ioctls involved on your platform in cuMemSetAccess (and in this case cuMemUnmap, as it internally calls cuMemSetAccess(PROT_NONE) to unmap the memory) are the UVM ioctls, which is open source (look for UvmMapExternalMemory when extracting the nvidia display driver package, the full source of this part of the kernel mode driver is readily available), so you’re free to inspect these paths yourself. Unfortunately, this path in particular ends up jumping into the proprietary binary blob, for which there’s not much I can give insight on in a public forum.

If this was an issue in our actual use-case, I would expect to see jittering in various other ioctls, not only these three.

It really depends on what other ioctls you’re looking at, as some ioctls have different locking semantics than others (some have read/write semantics, others are plain exclusive). The fact that you said nvidia-smi makes your application slower is an indication this could possibly be your issue, but I can’t be sure without looking at your use case directly, and/or seeing a timeline profile.

You had asked about paid consulting before, and that might be the direction you might want to take in order to have someone dedicated to look into your particular use case and see what we can do to help. I’ve sent you a direct message with a contact that you can use to describe your problem in more detail with a representative more suited to guide your application development. I’ll also be in touch with them and see if we can’t resolve your issue so we can hopefully post the solution for others to enjoy.

omri4 · June 15, 2021, 2:15pm

Hey Cory,
Sorry for the late response. You’ve been extremely helpful. I’ll contact the representative.

— Omri

omri4 · July 13, 2021, 5:26am

Hey @jmak, I reached out but received some error from your exchange server. It couldn’t be delivered to some specific mail address (because it’s restricted). Not sure if my email arrived properly.

razrotenberg · August 5, 2022, 8:20pm

Hi!

Is there support for these APIs in Windows systems as well? I’m having trouble with mine.
I’m using Windows Server 2022 with a single Tesla T4 GPU and NVIDIA driver version 516.01 installed.

At first I saw that the API cuMemAddressReserve succeeds but returns the address 0x0.
I then encountered the device attribute CU_DEVICE_ATTRIBUTE_VIRTUAL_ADDRESS_MANAGEMENT_SUPPORTED and querying the device for this attribute returned 0 as if this is not supported by the GPU.

As mentioned above, I’m using a Tesla T4 GPU which should support this feature.
I even verified that querying the same attribute for the same GPU type in a Linux machine returns 1.

I haven’t found any note that these APIs do not support Windows, and it also seems that the memMapIPCDrv CUDA sample supports Windows.

I wonder if this is something related to Windows or the GPU itself, and if I can somehow add support for this in my system.

Thanks in advance!

Cory.Perry · August 5, 2022, 8:33pm

Hi razrotenberg!

Is there support for these APIs in Windows systems as well?

Yes, but unfortunately only for WDDM driver model GPUs, which your Tesla T4 being a Tesla card only supports TCC driver model IIRC (more information about WDDM/TCC driver model can be found in the link below). This feature is supported on this GPU on native Linux platforms though if that’s an option for you. As to when or if this feature will be supported on TCC, I cannot say at this time, but I can say we’re actively looking into some options!

NVIDIA-SMI Documentation - Look for the Driver Model section.

Hope this helps, good luck!

razrotenberg · August 5, 2022, 8:43pm

Thanks @Cory.Perry for the super fast answer. Much appreciated!.

Is there any place where I could see which GPUs support the WDDM driver model? Is there a general rule or should I check per GPU type?
From peeking at the nvidia-smi documentation you sent, it seems that TCC is more for compute-related GPUs while WDDM is more for graphics. Does it mean that any data-center GPU will not support the WDDM driver model?
It also seems that WDDM driver model has some disadvantages compared to TCC in performance.

I really hope you’ll manage to add support for these amazing features to the TCC driver model as well.

Thanks again and I’ll be waiting for any news regarding this in the future!

Cory.Perry · August 5, 2022, 11:22pm

Thanks @Cory.Perry for the super fast answer. Much appreciated!.

No worries, happy to help!

Is there any place where I could see which GPUs support the WDDM driver model?

If you look at the nvidia-smi output, you can see the current driver model. With admin priviledges you can change the driver model on some GPUs, but IIRC all “Tesla” branded models do not support changing the driver model, but I don’t recall if that’s entirely accurate. I do know Quadro GPUs support switching driver models, and I believe drivers today require the GPU to have a display head (ability to connect to a screen, though an actual connection to a screen is not required) in order to use WDDM.

It also seems that WDDM driver model has some disadvantages compared to TCC in performance.

Most of the documentation surrounding TCC and WDDM touting lower performance are typically fairly old in my experience. Current drivers in WDDM can achieve comparative performance on par with TCC, and newer CUDA features like WSL support, these VMM apis, and many many more are only available on WDDM. For more up-to-date information on performance of WDDM, check out this other blog post by one of our lead CUDA Windows engineers, linked below! This blog post is mostly related to WSL, but there is information here about WDDM performance as well, and much of the performance benefits (and more) apply to native Windows WDDM as well. That all said, WDDM is a very different driver model from what most are used to with TCC and native Linux, and there are some caveats and pitfalls that one might run into. Most of these issues fall outside our CUDA Programming Model as defined (like buffering launches until synchronization rather than immediately). See the programming guide below for more information. I believe there’s also a few GTC presentations that go over WDDM performance tips as well that might be beneficial in this regard.

I really hope you’ll manage to add support for these amazing features to the TCC driver model as well.

Keep an eye out :)

Hope this helps, good luck and let us know if you have any other questions!

2944419175 · September 11, 2022, 5:21am

Hey Cory!

I am testing VMM random copy performance, and get in a strange performance problem when I initialize requestedHandleTypes in CUmemAllocationProp with CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR .

Considering we have two array, I want to read the src array randomly and copy to the dst array.

The testing kernel looks like the following:

__global__ void kernel(int *src, int *dst, size_t lenth) {
    size_t grid_size = gridDim.x * blockDim.x;
    size_t idx = threadIdx.x + blockIdx.x * blockDim.x;

    curandState state;
    curand_init(idx, 0, 0, &state);

    for (size_t i = idx; i < lenth; i += grid_size) {
        int j = curand(&state) % lenth;
        dst[i] = src[j];
    }
}

And src, dst are 1GB 1D arrays allocated by vmm api. Say their CUmemAllocationProp is prop, when intializing prop I set:

Case1: prop.requestedHandleTypes = 0
Case2: prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR

I find that Case1 performance is better than Case2 4 times. But I think they should have same performance. So my questions are:

How CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR affect cuMemCreate when allocating?
Why the random copy performance degradates so much with CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR handle type?

Cory.Perry · September 11, 2022, 8:44pm

Hi User 2944419175!

I am guessing you don’t hit the same issue with sequential copies? And when you run your random copy through one of our profilers, my guess is your TLB miss counters is unusually high? If so, then yeah, this is unfortunately a known issue with allocations that need to be importable with other user mode drivers like Vulkan. Some of these require a smaller page size than cuda’s default internal page size, and due some driver limitations we don’t support being able to use different page sizes with different mappings of the same allocation. This means CUDA is forced to map with a smaller page size, which puts more pressure on the gpu’s TLB.

The good news is, this should be fixed in an upcoming driver update very soon, so look forward to it! Hope this helps, good luck!

2944419175 · September 12, 2022, 9:50am

Hey, thanks for you fast answer, much appreciated!

Yes, this problem will not occur with sequential copies.

I see nvprof document, but I didn’t find some metrics about TLB miss counter. And I do a pointer chasing to test access latency , comfirm that TLB pressure is actually higher in Case2, and result shows that the first access (every cache miss) VMM with CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR seems to have higher latency than VMM without CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR or memory allocated by cudaMalloc?

Thanks again!

Cory.Perry · September 12, 2022, 6:03pm

I see nvprof document, but I didn’t find some metrics about TLB miss counter

I’m sorry, I’m not much of an expert with nvprof or nsys that I can direct you to the exact counters, but I believe many of our devblogs should be able to help guide you. If anything, I believe PC sampling and guided analysis should be able to indicate the large access times, give details into some cache analysis, and help you confirm this is the case. That said, based on the fact that sequential access is working fine, I would count on the TLB thrashing as being the the issue, unfortunately.

So, yes, by requesting a shareable handle type, you’re opting to create this allocation such that it is compatible with other UMDs and thus needs a smaller page size. Thus you will likely get higher latency accesses if you thrash the TLB (with random accesses across 1GiB for example) due to the smaller page size. As I mentioned, this should be fixed in an upcoming driver release (I’m not sure which one just yet, I can try to let you know when it does). Even so, it is always recommended, just as it is with CPUs though much more so with GPUs, to try to align your accesses in order to achieve maximum bandwidth and cache utilization whenever possible.

Hope this helps, please let us know if you have further questions or if we can help further! Good luck!