Introducing Low-Level GPU Virtual Memory Management

Thank you for the very detailed response, I appreciate it. It certainly enriched my understanding.

The NVLINK I ordered arrived today. It seems like I am able to use cuMemSetAccess in order to make cuMemcpyDtoDAsync work. (cuMemSetAccess failed with CUDA_ERROR_INVALID_DEVICE beforehand).

— Omri

Glad to help, I hope you’re happy with your purchase!

Yeah, in this case, cuMemSetAccess will fail with CUDA_ERROR_INVALID_DEVICE if any of the mapped allocations in the requested VA range’s physical location cannot be mapped by that device. You can check this easily at runtime by querying the attributes of the memory either with cuPointerGetAttribute or cuMemGetAllocationPropertiesFromHandle, retrieving the device ordinal for the source allocation and using cuDeviceGetP2PAttribute with the source device and the target device to see if P2P is supported.

Let us know if you have any more questions :)

Hey again Cory,
I’ve been having additional issues, wondering if you could shed some light.

I’ve been trying to understand performance issues that I’m having, I’ve created a simple program to debug the essence of it.
This program (which is attached) performs cuMemMap and cuMemUnmap in a for loop in a few threads.
If a single thread is running the average runtime for cuMemMap and cuMemUnmap is about 50us, while few threads are running the average runtime increases and sometimes even reaches 18ms!
I would like to ask if there are any reasons for that to happen and besides running as a single thread is there a way to avoid those runtime spikes.

— Omri
main.cpp (1.8 KB)

Hey Omri, welcome back!

So, looking through your sample application, I see you’re manipulating the address space of different contexts, but the same device. The CUDA Virtual Memory Management APIs manipulate the process-wide unified VA space, so you’re going to get lock contention from each of those threads, both in manipulating the UVA space and the device’s own page tables. The locks protecting these internal resources aren’t fair either, so it wouldn’t be a straight (N*50 us) where N is the number of threads since threads can release the lock and re-acquire the lock in any of the other calls manipulating the VA space for that device.

That said, cuMemMap itself should be fairly fast barring lock contention. cuMemSetAccess and cuMemUnmap (assuming there is a READWRITE mapping on the address range requested) would be the costly APIs, as those interact with the OS to perform the page table updates.

Hope this helps!

Hey Cory, thanks for the detailed response.
It seems reasonable that lock contention or OS interaction might be the problem here.
We’re still having trouble solving the problem at hand. Do you happen to know if NVidia offers paid consultancy on these CUDA-internal issues?

— Omri

Hey Omri,

We haven’t forgotten about you, don’t worry! One thing you can do is use any one of our profiling tools distributed with the CUDA Toolkit to profile your application. This should show in the timeline view any locks taken, even internal driver ones, which should give an understanding if there’s any lock contention.

As to paid consultancy on these issues, we’re talking with some folks and should be able to provide some information off this thread shortly.

Hey Cory. Thanks for remembering us.

My team and I tried using NVidia’s profiling tools, as you suggested. We’ve used various tools, and narrowed down the causes of the spikes we’re experiencing.

  1. cuMemUnmap and cuMemSetAccess do seem to hang in pthread_mutex_lock, which could cause starvation in the code example I’ve given above (given that Linux mutexes are unfair). However, in our actual use-case, most threads are not busy 100 percent of the time. Our internal calculations indicate that in our actual use-case, starvation wouldn’t be longer than 6.5ms, however we experience jitters of up to 15ms.
  2. We profiled lock contention and ioctl times inside our framework. We hooked pthread and libc’s ioctl, and each time it took more than 1ms to acquire a lock, or perform a ioctl - we logged it. We saw direct correlation between long lock times and long ioctls. As I see this - it means long ioctls to NVidia’s driver made other APIs block, which is reasonable.
    Specifically, we noticed that 3 ioctls seem to sometime cross the 1ms threshold (and up to 15ms):
    0x21 and 0x22 to /dev/nvidia-uvm (cuMemSetAccess, cuMemUnmap).
    0xc020462a to /dev/nvidiactl (this one is called from cuDeviceGetAttribute, which is called from libcufft).
  3. Furthermore, we noticed that if we run nvidia-smi while our program runs, we instantly experience contention.

As usual, we have several questions:

  • Is there any reason to repeatedly call cuDeviceGetAttribute? Are any of the values subject to change between subsequent calls? Why not cache these values?
  • Is there any way you can think of, to reduce the jittering of the ioctls performed by the virtual memory APIs? What exactly do these ioctls do, perhaps we can change some Linux configuration to make them more stable?

— Omri

This is great information, thank you!

Is there any reason to repeatedly call cuDeviceGetAttribute? Are any of the values subject to change between subsequent calls? Why not cache these values?

There are some dynamic values returned by cuDeviceGetAttribute, not all of them are static unfortunately. While there have been a few that were needlessly dynamic and have been fixed in recent drivers others like CU_DEVICE_ATTRIBUTE_CLOCK_RATE, CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE or CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT can change over time. Additionally, querying the amount of free memory available via cuMemGetInfo() also requires an ioctl to retrieve (and it’s highly recommended not to use this feature in production software).

Is there any way you can think of, to reduce the jittering of the ioctls performed by the virtual memory APIs? What exactly do these ioctls do, perhaps we can change some Linux configuration to make them more stable?

These APIs require OS interaction to perform the mapping operation. What you’re effectively asking is equivalent to asking “can the linux kernel reduce the jitter of the mmap() function call?”. In both cases, the OS takes shortcuts based on the available (and sometimes cached) resources on hand, and the apis scale with the size of the request. Finally, there’s the issue of the ioctls being serialized via locks as well which we are improving on in later drivers, but I don’t expect this to be fully parallelized in the near future. The shorter answer here is, we are always working on improving the performance and reducing jitter in our API calls, but I wouldn’t expect it to be eliminated or even reduced significantly in the near future.

Furthermore, we noticed that if we run nvidia-smi while our program runs, we instantly experience contention.

Yes, this is a known issue that is not specific to the CUDA Virtual Memory Management APIs. It isn’t just nvidia-smi, but any application that tries to access the kernel mode driver in various ways (i.e. another CUDA application) will take some shared kernel mode locks and be serialized with another process. The places where these shared locks are being taken are being reduced in later drivers, but there are a great many of such places.

Hope this helps!

Hey, thank you very much for your insightful answer.

These APIs require OS interaction to perform the mapping operation. What you’re effectively asking is equivalent to asking “can the linux kernel reduce the jitter of the mmap() function call?”.

This is, in fact, exactly what I’m asking. These API calls are opaque for me, I understand they require OS interaction, but I’m not sure which OS APIs do they use, exactly. The fact they actually interact with the operating system’s VM (i.e mmap) is great info. If you could further elaborate in what way do they interact with the OS, it would be incredible. We could try to fine-tune our OS to try and reduce this jittering. We’re currently looking at such solutions, but we’re doing so blindly.

Finally, there’s the issue of the ioctls being serialized via locks as well which we are improving on in later drivers, but I don’t expect this to be fully parallelized in the near future.

As I’ve said in my previous comment, I don’t think this is the issue we’re experiencing. Sure, in the sample application this should be a bottleneck - but in our actual use-case this shouldn’t be a problem, let alone the problem.
If this was an issue in our actual use-case, I would expect to see jittering in various other ioctls, not only these three.

— Omri

If you could further elaborate in what way do they interact with the OS, it would be incredible. We could try to fine-tune our OS to try and reduce this jittering. We’re currently looking at such solutions, but we’re doing so blindly.

So, the ioctls involved on your platform in cuMemSetAccess (and in this case cuMemUnmap, as it internally calls cuMemSetAccess(PROT_NONE) to unmap the memory) are the UVM ioctls, which is open source (look for UvmMapExternalMemory when extracting the nvidia display driver package, the full source of this part of the kernel mode driver is readily available), so you’re free to inspect these paths yourself. Unfortunately, this path in particular ends up jumping into the proprietary binary blob, for which there’s not much I can give insight on in a public forum.

If this was an issue in our actual use-case, I would expect to see jittering in various other ioctls, not only these three.

It really depends on what other ioctls you’re looking at, as some ioctls have different locking semantics than others (some have read/write semantics, others are plain exclusive). The fact that you said nvidia-smi makes your application slower is an indication this could possibly be your issue, but I can’t be sure without looking at your use case directly, and/or seeing a timeline profile.

You had asked about paid consulting before, and that might be the direction you might want to take in order to have someone dedicated to look into your particular use case and see what we can do to help. I’ve sent you a direct message with a contact that you can use to describe your problem in more detail with a representative more suited to guide your application development. I’ll also be in touch with them and see if we can’t resolve your issue so we can hopefully post the solution for others to enjoy.

Hey Cory,
Sorry for the late response. You’ve been extremely helpful. I’ll contact the representative.

— Omri

Hey @jmak, I reached out but received some error from your exchange server. It couldn’t be delivered to some specific mail address (because it’s restricted). Not sure if my email arrived properly.