Hello,
(I guess this question is hardware related, so please move them to suitable forum at your discretion)
The following diagram taken from NVIDIA published paper shows how ATS handles the TLB miss on GPU side:
(DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems)
In case of UVM/HMM, the page table is localized, thus a far fault will trigger page migration.
In case of GH200 with ATS:
- Is the ‘unified table’ located in CPU’s memory, i.e. the GMMU no longer maintains its own page table ?
- Does it follow from (1) that
- CPU will no longer have page fault when accessing GPU memory
- ATS is only used by GPU for CPU’s virtual-to-physical address translation.
- In case of a pointer created with cudaMalloc()
- Does its PTE also exist in the unified page table ?
- Consequently, can the CPU see and access it without memcpy() ?
- As I understand, Hopper will load a Grace’s L3 cache directly into its HBM.
Does it imply that DDR5 is bypassed altogether during any memory transaction because in case of UVM, the page is first transferred to CPU main memory.
We have many users with in-house codes using explicit memcpy().
It can be ambiguous because the pointer from malloc() can now be accessed by both CPU and GPU.
And we would like to assure our users that there won’t be any ‘surprised’ effect of ATS on memcpy().
Regards.