[GH200] Effect of ATS-enabled Unified Table on Explicit memcpy()

vitduck · August 5, 2024, 8:58am

Hello,
(I guess this question is hardware related, so please move them to suitable forum at your discretion)

The following diagram taken from NVIDIA published paper shows how ATS handles the TLB miss on GPU side:

(DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems)

In case of UVM/HMM, the page table is localized, thus a far fault will trigger page migration.
In case of GH200 with ATS:

Is the ‘unified table’ located in CPU’s memory, i.e. the GMMU no longer maintains its own page table ?
Does it follow from (1) that
- CPU will no longer have page fault when accessing GPU memory
- ATS is only used by GPU for CPU’s virtual-to-physical address translation.
In case of a pointer created with cudaMalloc()
- Does its PTE also exist in the unified page table ?
- Consequently, can the CPU see and access it without memcpy() ?
As I understand, Hopper will load a Grace’s L3 cache directly into its HBM.
Does it imply that DDR5 is bypassed altogether during any memory transaction because in case of UVM, the page is first transferred to CPU main memory.

We have many users with in-house codes using explicit memcpy().
It can be ambiguous because the pointer from malloc() can now be accessed by both CPU and GPU.
And we would like to assure our users that there won’t be any ‘surprised’ effect of ATS on memcpy().

Regards.

Topic		Replies	Views
Unified memory CUDA Programming and Performance	2	804	November 11, 2019
Details of Unified Memory and Oversubscription CUDA Programming and Performance hw , cuda , ubuntu	3	491	September 9, 2025
Using ATS on GH200 CUDA Programming and Performance llama	5	1140	February 7, 2025
cudaMemcpyAsync +cudaDeviceSynchronize lead to lots of gpu page fault CUDA Programming and Performance	10	1484	February 19, 2019
Explicit page migration without data copy for Unified Memory CUDA Programming and Performance	0	501	April 2, 2018
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1813	July 15, 2016
Unified memory: how to update page table without copying data? (sort of reset the unified memory to a default location) CUDA Programming and Performance	0	476	September 23, 2020
Memory-type quesions CUDA Programming and Performance	7	641	April 21, 2023
Does Unified Memory access counter only work for CPU-GPU? CUPTI – CUDA Profiler Tools Interface	0	22	November 12, 2025
Concurrent access the same page from two GPUs CUDA Programming and Performance	0	359	July 7, 2020

[GH200] Effect of ATS-enabled Unified Table on Explicit memcpy()

Related topics