Hello,
I am conducting research in the field of computer architecture, with a particular focus on the GPU memory subsystem. I have a few technical questions regarding NVIDIA’s Unified Memory and oversubscription support, particularly the hardware-level components and operational mechanisms since CUDA 8.0 introduced memory oversubscription capabilities.
To clarify, my questions are not about the very latest architectures such as Grace-Hopper or Grace-Blackwell, but rather concern the more broadly studied architectures like Volta, Turing, Ampere, and Lovelace. I am already familiar with higher-level techniques like the Access Counter-based migration policy and tree-based prefetching mechanisms used in Unified Memory. My interest here lies in the hardware-level memory management details, particularly during page migration and fault handling.
Question 1. Page fault handling when GPU accesses CPU-mapped memory under Unified Memory
Under the Unified Memory model, memory allocated via cudaMallocManaged()
may reside physically in either CPU or GPU memory. When the GPU attempts to access a page currently resident in CPU memory, my understanding is as follows:
The memory request first traverses through the GPU’s LD/ST unit, the Coalescing Unit, and then through multi-level TLBs (including potentially an L3 TLB). Eventually, the request reaches the GMMU, which performs a page table walk. Upon encountering a valid page table entry with aperture bits indicating the page resides in CPU memory, the GMMU logs a page fault into a fault buffer and a corresponding MSHR.
The CPU then services these faults in batches via the UVM driver. Pages are processed in ascending virtual address order, and migrations are typically done at the granularity of a 2MB VA block.
Could you please confirm whether this understanding is accurate, or if there are any misconceptions or missing steps in this flow?
Question 2. CPU handling of page faults: polling vs interrupt
In the above process, I understand that page fault requests are first buffered in the GPU’s page fault buffer. I am curious about how the CPU-side UVM driver becomes aware of these faults.
Does the UVM driver use a worker thread that polls for page faults periodically, or does the GPU issue a PCIe-based interrupt to notify the CPU? If an interrupt is involved, what conditions or thresholds (e.g., buffer full, max batch size reached) trigger the GPU to send this interrupt?
Question 3. TLB shootdown and page table updates during page migration
Since Unified Memory dynamically migrates pages between CPU and GPU memory, the following transitions require both TLB shootdowns and page table updates:
- CPU → GPU migration:
- Update CPU-side page table
- Invalidate CPU TLB entries
- Update GPU page table
- GPU → CPU migration:
- Update GPU page table
- Invalidate GPU TLB entries
- Update CPU page table
As far as I understand, the CPU-side TLB shootdown and page table update are typically software-managed and can vary by platform. I would like to know:
- How does NVIDIA handle TLB shootdown and page table updates inside the GPU, particularly during migration triggered by Unified Memory?
- Are there any publicly available performance characterizations or quantitative overhead metrics for these operations?
I just find one industrial result about gpu tlb shootdown, but it’s too old and about AMD gpus…
[ISPASS’16] Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems
Any references to technical documentation or academic papers covering this topic would be greatly appreciated.
Thank you very much for your time and support.
Best regards,
[Boyeol Choi]