I am currently working with CUDA’s low-level virtual memory management APIs but am encountering performance issues on certain platforms. Specifically, I have observed two main issues:
Platform Variability: These APIs show inconsistent performance across cloud platforms.
Scalability Limitation: The APIs do not scale efficiently with an increased number of threads.
Here are further details on each issue:
1. Platform Variability
I benchmarked the performance of three APIs—cuMemMap, cuMemSetAccess, and cuMemUnmap—on a setup with an A100-80G GPU and CUDA 12.1 on both GCP and Runpod. The latency results are as follows:
Notably, the latencies for cuMemSetAccess and cuMemUnmap are significantly higher on Runpod (by 1.6x–4.7x), and this discrepancy persists on H100-80G SXM as well. The performance on GCP appears more consistent with expectations. I suspect the difference may be due to Runpod’s use of containerized GPU isolation, which itself could be a problem or might limit control over the CUDA and NVIDIA driver versions on the host. Any insights or suggestions on how to mitigate this performance gap would be highly appreciated.
2. Scalability Limitation
To focus on scalability, I ran a parallel benchmark of the same APIs on GCP to avoid the platform variability issue. Here are the results across varying thread counts:
Interestingly, using four threads significantly increases the latency of each API call, resulting in an overall throughput lower than that achieved with a single thread. Is there any parameter adjustment or approach you would recommend to optimize API throughput in multi-threaded contexts?
These APIs are critical to my project, and I would be very grateful for any guidance or suggestions you can offer.
Many CUDA API’s can suffer in terms of their latency in a multi-threaded scenario. This topic is covered in numerous forum posts, as well as a brief mention in the documentation. To avoid multi-threading lock overhead in the API calls, one approach would be to handle all your memory allocation work in a single thread.
I’ve not heard of runpod, however the hypervisor that may be in use at various cloud platforms can certainly affect the latency of various calls, especially any that may make calls involving non-userspace code. It’s possible that differences in hypervisor behavior could show up. Obviously CUDA and NVIDIA have no direct control over that. You could possibly ask runpod about it.
The general advice I would offer is to make limited use of memory allocation APIs (such that these issues don’t become a dominant factor in application behavior). Allocate what you need ahead of any performance or concurrency-sensitive work-issuance loops, then reuse allocations. This suggestion holds whether you are using traditional API like cudaMalloc or the VMM APIs.
As an alternative or “helper”, employ a memory manager (pool allocator) e.g. RMM, the rapids memory manager. I won’t be able to give a tutorial on that here, but it can make a pool allocator for you. As a “lighter-weight” alternative, the stream-oriented memory allocation system in CUDA may be another alternative that may provide benefit in certain “reuse” scenarios (RMM can also manage such, for you, if you wish).
No, they are not exactly the same. The Runpod instance uses an AMD EPYC 7713 64-core CPU with 1TB of memory, while GCP does not disclose their specific CPU model and I can only tell the GCP instance uses an Intel Xeon CPU and 128GB of memory. But that’s a good observation—different CPU models might indeed influence performance. Have you had any experience comparing the impact of different CPUs when used with Nvidia GPUs?
Thank you Robert for the quick response and information! I reviewed the related document and understand that internal resources within the CUDA runtime/driver can sometimes contend with each other. Since these memory management APIs are important for achieving more flexible memory handling, I was wondering if there are any plans to improve their latency and throughput.
Apologies for not providing more details about Runpod earlier. It is a GPU cloud provider offering relatively cheaper GPU instances compared to GCP and AWS. As far as I know, they don’t use virtual machines per instance but instead create Docker containers. In this containerized setup, without a hypervisor and syscall interception, I wonder how this might differ from a VM environment in terms of API performance.
Thank you for suggesting I confirm this issue with them. I’ve already reached out to Runpod to check if this performance issue stems from their infrastructure. I’m awaiting their response and am happy to share my findings if it could benefit the NVIDIA community.
And thank you for the suggestions on using a memory allocator. However, one of the major goals of my project is to dynamically commit and free physical memory, so a memory allocator like RMM may not fit here. I’ve done my best to limit my usage to CUDA’s low-level memory management APIs, but due to their low-level nature (e.g., mapping/unmapping pages, setting access permissions), it’s difficult to fundamentally reduce the number of calls. Therefore, it would be great to know if there are any plans to improve the performance of these APIs.
It has been a few years since I had to do careful measurements of various CUDA APIs. Generally speaking, memory allocation and memory mapping are activities that consist of pretty much all host-side work. They may involve calling into operating system APIs. If your uses case involves dynamic code generation with JIT compilation, that is pretty much all host-side work. Other CUDA APIs have a host-side component, for example context initialization (major) and kernel launches (minor).
So the speed of the host system can be a factor in the performance of a CUDA accelerated application, especially when you consider the speed of various CUDA APIs as critical to your use case. The main GPU-related performance factor for CUDA APIs (as opposed to kernel execution) is usually GPU memory size, in particular for CUDA context initialization, which requires GPU memory to be mapped into a unified virtual address space: the larger the GPU memory, the longer it takes to map it.
In addition, much of the NVIDIA driver and OS work that takes place is of a single-threaded nature, So if timing is critical (as implied by your OP), you may want to explore (e.g. via a temporary loaner machine) how much incremental speedup can be achieved by utilizing a host platforms with high single-thread performance, which to first order means high operating frequency.
For EPYC-based platforms my recommendation would be to look at the “Performance Enterprise” line, that is, model numbers 9174F, 9274F, 9374F, 9474F (Genoa architecture). You see the pattern here: these parts can be recognized by the ‘F’ suffix for “frequency optimized”. AMD is gearing up to ship Turin architecture CPUs, where there will be an equivalent line of processors, model numbers 9175F, 9275F, 9375F, 9475F, 9575F. To my knowledge these parts are not shipping yet.
For estimating the performance of CUDA’s host-side workload, I would guess that a composite of the following constituent benchmarks of SPECspeed® 2017 Integer might provide a reasonable representation of this code (“base” performance): 602.gcc, 620.omnetpp, 623.xalancbmk. You could search the literature to see whether a proper workload characterization exists somewhere.
The advantage of using SPEC benchmarks is that the SPEC organization maintains a large publicly available database of published results that collectively cover a large percentage of relevant hardware.
The ultimate arbiter of relative performance is obviously running actual code on actual systems, but that is sometimes difficult to when expensive HPC systems are in play.