Hardware coherence over NVLink

Hello,

I am trying to use the new features of NVLink, such as coherence. But I got some questions:

  1. Is hardware coherence enabled between two GPUs connected with NVLink? If not, how to turn it on? I tried a test program, and coherence is supported.
  2. What is the relationship between unified virtual memory and NVLink coherence? I tested this using a small program. It seems unified virtual memory overwhelms NVLink coherence, if the memory is allocated by cudaMallocManaged. The coherency is guaranteed by unified virtual memory.
  3. Do you have some suggestions when I should use unified virtual memory or NVLink coherence, in terms of performance? Do you have some examples?

Thank you so much!

GPUs that are connected via NVLink and have P2P enabled (“peer access”) between them (cudaDeviceEnablePeerAccess()) can access allocations in non-local memory as if it were local (from a programming perspective). These would be “ordinary” device allocations created with cudaMalloc.

Managed allocations have coherency claims(1), but the programmer must still understand the variability of access order and may need to provide some synchronization mechanism, when multiple processors are accessing a single UM allocation, to avoid hazards.

The same synchronization notion applies to peer access.

In the case of managed memory, NVLink acts as a fast transport path for migration of data. The coherency is supported via data migration. Effectively, only one processor in the system can access data at any point in time, and data is moved processor-to-processor, page-wise, on demand. I’m mostly ignoring the idea of “ReadMostly” type managed allocations (although these don’t negate any coherency claims). The assumption here is a typical managed allocation, without memory hints, that is migratable in a post-Pascal (demand-paged) UM regime.

The NCCL library code is open source and shows how to do synchronized movement of data between GPUs over NVLink.

(1) [url]https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf[/url]

Hi Robert,

Thanks for your quick response.

If synchronization should be performed in any cases, what is the purpose for supporting hardware coherence on NVLink? How can I leverage this new feature to boost performance?

Best,

So, in other words, NVLink does NOT provide hardware cache coherency. Meaning, the MESI cache coherency protocol messages do not propagate through NVLINK. In yet other words, an NVIDIA DHX-H100 system does not provide true hardware cache coherent shared memory across all GPUs in a DGX-H100 system.
I Imagine NVLINK-C2C will actually provide hardware hardware cache coherent shared memory?