Hardware coherence over NVLink

liuhy · September 12, 2019, 10:13pm

Hello,

I am trying to use the new features of NVLink, such as coherence. But I got some questions:

Is hardware coherence enabled between two GPUs connected with NVLink? If not, how to turn it on? I tried a test program, and coherence is supported.
What is the relationship between unified virtual memory and NVLink coherence? I tested this using a small program. It seems unified virtual memory overwhelms NVLink coherence, if the memory is allocated by cudaMallocManaged. The coherency is guaranteed by unified virtual memory.
Do you have some suggestions when I should use unified virtual memory or NVLink coherence, in terms of performance? Do you have some examples?

Thank you so much!

Robert_Crovella · September 12, 2019, 11:23pm

GPUs that are connected via NVLink and have P2P enabled (“peer access”) between them (cudaDeviceEnablePeerAccess()) can access allocations in non-local memory as if it were local (from a programming perspective). These would be “ordinary” device allocations created with cudaMalloc.

Managed allocations have coherency claims(1), but the programmer must still understand the variability of access order and may need to provide some synchronization mechanism, when multiple processors are accessing a single UM allocation, to avoid hazards.

The same synchronization notion applies to peer access.

In the case of managed memory, NVLink acts as a fast transport path for migration of data. The coherency is supported via data migration. Effectively, only one processor in the system can access data at any point in time, and data is moved processor-to-processor, page-wise, on demand. I’m mostly ignoring the idea of “ReadMostly” type managed allocations (although these don’t negate any coherency claims). The assumption here is a typical managed allocation, without memory hints, that is migratable in a post-Pascal (demand-paged) UM regime.

The NCCL library code is open source and shows how to do synchronized movement of data between GPUs over NVLink.

(1) [url]https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf[/url]

liuhy · September 13, 2019, 4:51pm

Hi Robert,

Thanks for your quick response.

If synchronization should be performed in any cases, what is the purpose for supporting hardware coherence on NVLink? How can I leverage this new feature to boost performance?

Best,

lstevens · May 1, 2023, 2:23am

So, in other words, NVLink does NOT provide hardware cache coherency. Meaning, the MESI cache coherency protocol messages do not propagate through NVLINK. In yet other words, an NVIDIA DHX-H100 system does not provide true hardware cache coherent shared memory across all GPUs in a DGX-H100 system.
I Imagine NVLINK-C2C will actually provide hardware hardware cache coherent shared memory?

Topic		Replies	Views
Can Unified Memory Migration use NVLink? CUDA Programming and Performance	2	711	October 12, 2021
GPU cache coherence problem CUDA Programming and Performance	7	5042	October 31, 2019
NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data Technical Blog	14	554	March 31, 2016
Programming with NVLINK CUDA Programming and Performance	9	5452	April 18, 2018
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	3503	March 10, 2018
multiple gpu and unified memory CUDA Programming and Performance	3	4481	March 29, 2022
How to balance nvlink CUDA Programming and Performance	8	586	April 27, 2024
OpenACC directives to transfer data between GPUs Legacy PGI Compilers	3	817	May 7, 2021
State-of-the-art GPU -> CPU data transfer techniques CUDA Programming and Performance	6	4070	December 1, 2016
Performance problems with NVLink and L2 cache CUDA Programming and Performance	6	990	September 26, 2022

Hardware coherence over NVLink

Related topics