Page fault profiling

I have CUDA program that has two method of accessing which is UVM and UVA. I want compare both methods for that particular program. I am getting execution time of UVM much higher compared to UVA and I want to understand and analyze why that being reason. I am thinking of looking into count of page faults at each time interval and page fault time, but I am unsure how to do the same

@jasoncohen can you respond here?

Hi @sparx,

Starting at the high level since I’m not sure which tools you’ve tried here… You can use Nsight Systems to see a timeline of your program’s execution, which includes your CUDA kernel launches, manual cudaMemcpy operations, and UVM’s page faults and on-demand transfers. Note that enabling UVM (a.k.a UM or Unified Memory) tracing in Nsight Systems does add significant overhead when lots of faults or transfers occur. So using this tool may make your UVM performance appear worse than it would be normally, but it does help to show you where the UVM activity is occurring. If you want more low-level information about a particular CUDA kernel’s performance, you can also try Nsight Compute, which focuses on deep-dive perf analysis of CUDA kernels. I’m not sure if Nsight Compute can capture UVM fault counts, but the docs for that tool should say.

If you’re trying to understand why things are happening the way they are, I’m going to need a bit of clarification on what exactly these two methods are doing. UVM typically means using cudaMallocManaged and then accessing the memory on the CPU and GPU without memcpy calls, letting the system page the allocations back and forth between CPU and GPU as needed. UVA on the other hand is not really a “method of accessing” – it’s a feature enabled automatically on all 64-bit NVIDIA driver installations that reserves separate virtual address ranges for each GPU and the CPU, so the virtual address space is “flat” across all the physical memory spaces. So can you clarify what you are doing for your “UVA” method? Typically the alternatives are:
- Using cudaMalloc to allocate video memory and malloc to allocate pageable system memory, then using cudaMemcpy to do transfers between them. This is generally slower than other options due to the pageable memory requiring it to be partly synchronous.
- Using cudaMalloc to allocate video memory and cudaHostAlloc to allocate pinned system memory, then using cudaMemcpy to do transfers between them. This can be fully asynchronous and is usually the fastest.
- “Zero copy” - Only using cudaHostAlloc to allocate pinned system memory, and having both the CPU and all GPUs read/write directly to this system memory. This is fast on CPU and slow on GPU, but it does skip the need to call cudaMemcpy at all, so if you only access small bits of memory from the GPU or if you don’t re-read the same values multiple times from memory (e.g. if your whole dataset fits in the GPU’s cache), then this method could be fastest.

In my experience, the UVM performance tends to be similar to copying with pageable memory, both of which are worse than copying with pinned memory. The only time UVM will tend to perform better is when a kernel may only need to access small sections of a large allocation, and it’s unknown before the kernel launches which sections those could be. In this case, the cudaMemcpy methods would require uploading the entire large allocation before starting the kernel, whereas UVM will only page-fault to transfer the pages it needs. But more commonly, a well-crafted manual cudaMemcpy approach using CUDA streams effectively will outperform a UVM approach. You can sometimes improve the UVM performance by making Prefetch calls before the kernel launch to force-migrate data to the GPU before the kernel runs and avoid faulting during the kernel, but at this point there’s not much advantage to using UVM this way vs. manually doing the copies.

Hope that helps!