Details of Unified Memory and Oversubscription

gju06051 · April 16, 2025, 6:48am

Hello,

I am conducting research in the field of computer architecture, with a particular focus on the GPU memory subsystem. I have a few technical questions regarding NVIDIA’s Unified Memory and oversubscription support, particularly the hardware-level components and operational mechanisms since CUDA 8.0 introduced memory oversubscription capabilities.

To clarify, my questions are not about the very latest architectures such as Grace-Hopper or Grace-Blackwell, but rather concern the more broadly studied architectures like Volta, Turing, Ampere, and Lovelace. I am already familiar with higher-level techniques like the Access Counter-based migration policy and tree-based prefetching mechanisms used in Unified Memory. My interest here lies in the hardware-level memory management details, particularly during page migration and fault handling.

Question 1. Page fault handling when GPU accesses CPU-mapped memory under Unified Memory

Under the Unified Memory model, memory allocated via cudaMallocManaged() may reside physically in either CPU or GPU memory. When the GPU attempts to access a page currently resident in CPU memory, my understanding is as follows:

The memory request first traverses through the GPU’s LD/ST unit, the Coalescing Unit, and then through multi-level TLBs (including potentially an L3 TLB). Eventually, the request reaches the GMMU, which performs a page table walk. Upon encountering a valid page table entry with aperture bits indicating the page resides in CPU memory, the GMMU logs a page fault into a fault buffer and a corresponding MSHR.

The CPU then services these faults in batches via the UVM driver. Pages are processed in ascending virtual address order, and migrations are typically done at the granularity of a 2MB VA block.

Could you please confirm whether this understanding is accurate, or if there are any misconceptions or missing steps in this flow?

Question 2. CPU handling of page faults: polling vs interrupt

In the above process, I understand that page fault requests are first buffered in the GPU’s page fault buffer. I am curious about how the CPU-side UVM driver becomes aware of these faults.

Does the UVM driver use a worker thread that polls for page faults periodically, or does the GPU issue a PCIe-based interrupt to notify the CPU? If an interrupt is involved, what conditions or thresholds (e.g., buffer full, max batch size reached) trigger the GPU to send this interrupt?

Question 3. TLB shootdown and page table updates during page migration

Since Unified Memory dynamically migrates pages between CPU and GPU memory, the following transitions require both TLB shootdowns and page table updates:

CPU → GPU migration:
- Update CPU-side page table
- Invalidate CPU TLB entries
- Update GPU page table
GPU → CPU migration:
- Update GPU page table
- Invalidate GPU TLB entries
- Update CPU page table

As far as I understand, the CPU-side TLB shootdown and page table update are typically software-managed and can vary by platform. I would like to know:

How does NVIDIA handle TLB shootdown and page table updates inside the GPU, particularly during migration triggered by Unified Memory?
Are there any publicly available performance characterizations or quantitative overhead metrics for these operations?

I just find one industrial result about gpu tlb shootdown, but it’s too old and about AMD gpus…
[ISPASS’16] Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems

Any references to technical documentation or academic papers covering this topic would be greatly appreciated.

Thank you very much for your time and support.

Best regards,
[Boyeol Choi]

Topic		Replies	Views
Unified memory CUDA Programming and Performance	2	762	November 11, 2019
Unified Memory is pageable? it can be swap out to disk? CUDA Programming and Performance	6	103	April 2, 2025
Unified Memory - Page Fault Handling CUDA Programming and Performance	2	5358	July 16, 2018
[GH200] Effect of ATS-enabled Unified Table on Explicit memcpy() CUDA Programming and Performance hw , cuda	0	70	August 5, 2024
How to free gpu pages in unified memory so that subsequent cudaMalloc can use more memory? CUDA Programming and Performance	0	26	July 7, 2025
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2851	March 23, 2018
Unnecessary HtoD page migration overhead on write when using Unified Memory CUDA Programming and Performance	1	485	February 15, 2019
Page migration engine in UM CUDA Programming and Performance	5	1941	November 17, 2018
Concurrent access the same page from two GPUs CUDA Programming and Performance	0	343	July 7, 2020
Unified Memory in Pascal CUDA Programming and Performance	5	876	September 24, 2018

Details of Unified Memory and Oversubscription

Related topics