NVlink memory

wanderine · December 11, 2018, 9:21am

I’m considering to buy a DGX system to train a memory consuming 3D CNN, and want to understand how the shared memory in Nvlink works. If I implement the 3D CNN in Keras or some other framework, will it see all GPUs and all GPU memory in the DGX system as a single GPU with a very large memory, or do I need to modify my code for multi-GPU?

Robert_Crovella · December 11, 2018, 2:32pm

Both scenarios are possible.

In a DGX-1V, the GPUs are connected in a particular pattern called a hybrid cube mesh. The result of this is that each GPU sees 4 other GPUs (out of the other 7) as neighbors. It’s possible to write code that directly accesses the memory on these 4 neighbor GPUs. However it’s not possible to do this in the general case with all 7.

On the other hand, the DGX-1V hybrid cube mesh is designed so that there is a ring which connects all 8 GPUs with doubly-connected NVLink links. Software libraries like NCCL provide mechanisms to transfer data among all 8 GPUs, typically using this double-link-ring “fast path” for common collective algorithms like all-reduce or broadcast. Frameworks running on DGX-1V are typically designed to make use of NCCL to transfer data (rather than directly in DGX-1 due to the neighbor limitation for direct GPU-GPU memory traffic). Keras sits on top of one or more of these frameworks (e.g. it could sit on top of Tensorflow, for example). This type of communication system is how CNN training in a DGX-1 would typically be carried out; it is all engineered into the framework and the libraries (like NCCL) that it uses.

Therefore multi-GPU model training is well handled by frameworks like Tensorflow (and by extension, Keras) and you wouldn’t need to do anything “custom” on your own. You typically would use the facilities already provided by Tensorflow for multi-GPU work, and be done with it.

You could write your own code to transfer data directly, as if it were one big memory array, subject to the 4-neighbor limitation in DGX-1. In practice I don’t think many people do that for the purpose of distributed model training.

In DGX-2, the architecture is different. Every GPU is connected not directly to other GPUs over NVLink, but instead to a NVSwitch switch tree. This switch tree can be seen as a massive crossbar switch, providing a logical connection from every GPU to every other GPU (amongst all 16 GPUs in a DGX-2). This has several benefits:

We can now write programs where all GPU memory (at least among 8 GPUs) is mapped into the address space of a single GPU, and the single GPU can work with that memory from a logical perspective as if it were local.
Each GPU can connect to every other GPU in the system directly.
The peak theoretical bandwidth between any 2 GPUs in the system is the aggregate bandwidth of the 6 Gen2 NVLinks, in this case 300GB/s aggregate peak theoretical bandwidth between any 2 GPUs (150GB/s peak, per direction). This 300GB/s number is on the order of the main memory bandwidth of, for example, a Titan X card (480GB/s, peak). Therefore, although a DGX-2 GPU has ~900GB/s bandwidth to its local memory, it still has a possibly large amount of bandwidth (300GB/s) to any of its neighbors.

Again, typically, in either architecture, you would simply use the features built-in to the framework you are using to do distributed model training. The framework takes care of choosing the best communication methods between those GPUs.

wanderine · December 11, 2018, 2:39pm

Thank you, and how does it work on the cheaper DGX station?

Robert_Crovella · December 11, 2018, 2:54pm

DGX-station has 4 GPUs that are connected via NVLink (“Fully-connected 4-way”). Each GPU has 3 neighbors with a double-link connection to each. Once again, GPU memory can be mapped into neighbor address space, but frameworks would typically use NCCL to optimize the communication pattern between the 4 GPUs, for distributed model training.

If you’d like a deeper dive on DGX-station, I suggest reading the architecture whitepaper:

[url]https://www.nvidia.com/object/dgx-station-system-architecture-whitepaper.html[/url]

Topic		Replies	Views
Is possible multiples GPUs work as one with more memory via NVlink? cuDNN	2	3375	April 27, 2021
NVIDIA DGX-1: The Fastest Deep Learning System Technical Blog	2	562	April 17, 2020
Can NVLink combine 2x GPUs into 1x Big GPU? Frameworks (archived) tensorflow	3	13445	June 5, 2019
Multi-GPU Training time is slower than single-GPU CUDA Programming and Performance	0	533	February 2, 2023
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	4123	March 10, 2018
Using NVlink bridge makes big impact on the training speed on 2x RTX 2080 (multi GPU training with p2p) CUDA Programming and Performance	5	4410	January 20, 2023
Can I make a NVLinked 2x RTX 2080Tis as 1x big GPUs? CUDA Setup and Installation	1	677	May 16, 2019
NVSwitch: Leveraging NVLink to Maximum Effect Technical Blog	0	448	August 21, 2022
Model sharding, data parallelism and NVLink cuDNN	2	1056	January 25, 2022
Use all graphics memory in DGX Station Frameworks (archived) tensorflow	1	546	October 11, 2019

NVlink memory

Related topics