I’m considering to buy a DGX system to train a memory consuming 3D CNN, and want to understand how the shared memory in Nvlink works. If I implement the 3D CNN in Keras or some other framework, will it see all GPUs and all GPU memory in the DGX system as a single GPU with a very large memory, or do I need to modify my code for multi-GPU?
Both scenarios are possible.
In a DGX-1V, the GPUs are connected in a particular pattern called a hybrid cube mesh. The result of this is that each GPU sees 4 other GPUs (out of the other 7) as neighbors. It’s possible to write code that directly accesses the memory on these 4 neighbor GPUs. However it’s not possible to do this in the general case with all 7.
On the other hand, the DGX-1V hybrid cube mesh is designed so that there is a ring which connects all 8 GPUs with doubly-connected NVLink links. Software libraries like NCCL provide mechanisms to transfer data among all 8 GPUs, typically using this double-link-ring “fast path” for common collective algorithms like all-reduce or broadcast. Frameworks running on DGX-1V are typically designed to make use of NCCL to transfer data (rather than directly in DGX-1 due to the neighbor limitation for direct GPU-GPU memory traffic). Keras sits on top of one or more of these frameworks (e.g. it could sit on top of Tensorflow, for example). This type of communication system is how CNN training in a DGX-1 would typically be carried out; it is all engineered into the framework and the libraries (like NCCL) that it uses.
Therefore multi-GPU model training is well handled by frameworks like Tensorflow (and by extension, Keras) and you wouldn’t need to do anything “custom” on your own. You typically would use the facilities already provided by Tensorflow for multi-GPU work, and be done with it.
You could write your own code to transfer data directly, as if it were one big memory array, subject to the 4-neighbor limitation in DGX-1. In practice I don’t think many people do that for the purpose of distributed model training.
In DGX-2, the architecture is different. Every GPU is connected not directly to other GPUs over NVLink, but instead to a NVSwitch switch tree. This switch tree can be seen as a massive crossbar switch, providing a logical connection from every GPU to every other GPU (amongst all 16 GPUs in a DGX-2). This has several benefits:
- We can now write programs where all GPU memory (at least among 8 GPUs) is mapped into the address space of a single GPU, and the single GPU can work with that memory from a logical perspective as if it were local.
- Each GPU can connect to every other GPU in the system directly.
- The peak theoretical bandwidth between any 2 GPUs in the system is the aggregate bandwidth of the 6 Gen2 NVLinks, in this case 300GB/s aggregate peak theoretical bandwidth between any 2 GPUs (150GB/s peak, per direction). This 300GB/s number is on the order of the main memory bandwidth of, for example, a Titan X card (480GB/s, peak). Therefore, although a DGX-2 GPU has ~900GB/s bandwidth to its local memory, it still has a possibly large amount of bandwidth (300GB/s) to any of its neighbors.
Again, typically, in either architecture, you would simply use the features built-in to the framework you are using to do distributed model training. The framework takes care of choosing the best communication methods between those GPUs.
Thank you, and how does it work on the cheaper DGX station?
DGX-station has 4 GPUs that are connected via NVLink (“Fully-connected 4-way”). Each GPU has 3 neighbors with a double-link connection to each. Once again, GPU memory can be mapped into neighbor address space, but frameworks would typically use NCCL to optimize the communication pattern between the 4 GPUs, for distributed model training.
If you’d like a deeper dive on DGX-station, I suggest reading the architecture whitepaper:
[url]https://www.nvidia.com/object/dgx-station-system-architecture-whitepaper.html[/url]