This is great information, thank you!
Is there any reason to repeatedly call cuDeviceGetAttribute? Are any of the values subject to change between subsequent calls? Why not cache these values?
There are some dynamic values returned by cuDeviceGetAttribute, not all of them are static unfortunately. While there have been a few that were needlessly dynamic and have been fixed in recent drivers others like CU_DEVICE_ATTRIBUTE_CLOCK_RATE, CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE or CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT can change over time. Additionally, querying the amount of free memory available via cuMemGetInfo() also requires an ioctl to retrieve (and it’s highly recommended not to use this feature in production software).
Is there any way you can think of, to reduce the jittering of the ioctls performed by the virtual memory APIs? What exactly do these ioctls do, perhaps we can change some Linux configuration to make them more stable?
These APIs require OS interaction to perform the mapping operation. What you’re effectively asking is equivalent to asking “can the linux kernel reduce the jitter of the mmap() function call?”. In both cases, the OS takes shortcuts based on the available (and sometimes cached) resources on hand, and the apis scale with the size of the request. Finally, there’s the issue of the ioctls being serialized via locks as well which we are improving on in later drivers, but I don’t expect this to be fully parallelized in the near future. The shorter answer here is, we are always working on improving the performance and reducing jitter in our API calls, but I wouldn’t expect it to be eliminated or even reduced significantly in the near future.
Furthermore, we noticed that if we run nvidia-smi while our program runs, we instantly experience contention.
Yes, this is a known issue that is not specific to the CUDA Virtual Memory Management APIs. It isn’t just nvidia-smi, but any application that tries to access the kernel mode driver in various ways (i.e. another CUDA application) will take some shared kernel mode locks and be serialized with another process. The places where these shared locks are being taken are being reduced in later drivers, but there are a great many of such places.
Hope this helps!