This is my idea about GPU multi-DIE design

user23000 · November 13, 2021, 11:09am

i have a question, how can i use multiple GPU at same time, if answer is yes,so can anyone explain me how

pofice · January 29, 2022, 4:22am

GPU core architecture diagram: The core code is Creeper 200, a new interconnection method, and L1.5 Cache is introduced between SM and L2 Cache to play a buffering role. Now, ZRLink can directly access L2 Cache, and each GPM has 8 ZRLinks for high-speed interconnection with GPM. And has a redundant design, even if one ZRLink link is damaged, causing direct communication between two GPMs in a full connection, the ZRLink link can pass through other GPMs to reach the GPM that needs to be accessed.

pofice · January 29, 2022, 6:44am

The interconnection method of 8 GPU modules: Among the 8 GPU Modules, each GPM is fully connected to each other, and each GPM will have a link connected to the Switch, forming a Fully connected + Star interconnection method. When exchanging data between two GPUs, the data exchange will be preferentially carried out through the full connection, and when accessing another GPU, they will be connected through the switch. At the same time, the interconnection topology between switches is also a Fully connected + Star interconnection method. When data exchange is performed between each switch (GPU), a fully connected link will be preferentially selected to ensure lower latency and greater bandwidth. In a system with 8 GPUs, the CPU can be directly connected to each switch to form a Star interconnection method to reduce latency without changing the existing CPU as much as possible.

pofice · September 12, 2022, 7:03am

This is a GPU consisting of 8 GPMs as shown in the figure. The middle DIE contains Switch and PCIe Host interfaces, which can take care of the high-speed interconnections between each GPM and other GPUs, while providing PCIe Host interfaces for each GPM to ensure it can be used in ordinary computers and improve versatility.

pofice · November 20, 2022, 12:29pm

GPU core architecture diagram: core code Block 100, ALU cluster based on unified computing architecture, 128 FP32 units per SM unit, 8 tensor cores. There are 10 SM units in each group of GPC units, and there are two GPC units in each GPU DIE. There are 2560 FP32 CUDA COREs, which support multi-instance GPU (MIG) technology. A GPU DIE can be divided into up to 20 independent GPU instances, a GPC unit or TPC unit or SM unit can be divided into one GPU instance, so a GPU DIE can be divided into 2 or 10 or 20 GPU instances, totally three modes. GPU instances of each mode can run simultaneously, and each mode can match each other. GPU instances of different modes can run simultaneously. Each instance has its own memory, cache, and streaming multiprocessors. In multiple GPU instances, GPU cloud acceleration can be provided to a large number of clients, such as the account of each mobile phone, and different resources can be allocated according to different accounts. And some of the less configured clients can use GPU cloud acceleration to make the game less stunning, and can be applied to school servers to provide GPU computing power for each client, and more… and doubling the number of ZRLinks at the core of this GPU also means doubling the bandwidth of the connection. L1.5 Cache is also introduced between SM and L2 Cache to buffer
The four memory controllers (Memory controllers) on the GPU are applied to HBM3 high-bandwidth memory, which is close to the display memory and easy to wiring to save production costs. The memory controller on the left and right ends is applied to GDDR6X memory. For in-depth learning training, the GPU prefers HBM3 high-bandwidth memory, because generally, the data model for in-depth learning training is not very large, plus support for mixed precision (FP16+FP8), so the demand for video memory capacity is usually not very tense during training.
For AI reasoning and training some large models, the size of the display memory is particularly important, because it directly determines whether the GPU can be trained and reasoned. This requires a memory controller (GDDR6X) on both the left and right ends. With the iteration of deep learning algorithms, the GPU now requires more and more memory. For example, the recently leaked Diffusion Model-based AI painting, which relies on single-precision floating-point performance, cannot even reasonably calculate 720P image size with 24GB of memory. This has caused me to think deeply, and I think we can support HBM3 and GDDR6X memory in one core at the same time. Because GDDR6X is cheaper than HBM3 and has much higher bandwidth than CPU’s shared memory, it is a cost-effective option and we can call GDDR6X when HBM3 is not enough. They are encapsulated in the same SXM module, and the power supply in the SXM module is transferred to the motherboard to improve the calculation density.

Topic		Replies	Views
Kepler and Maxwell, oh my! CUDA Programming and Performance	55	56102	October 19, 2010
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	1716	October 9, 2024
four 9800GX2 cards: will it work? CUDA Programming and Performance	33	23533	May 28, 2008
The fastest platform of GPU computing CUDA Programming and Performance	38	40643	January 19, 2010
embed system the relation ship between arm cores and gup should be more different than pc system. Jetson TK1	0	550	June 8, 2015
Shopping-list for Cuda GPGPU System in 800-1000 euro price-range Goal: A 'budget' GTX 470 (F CUDA Programming and Performance	59	12351	April 15, 2010
Using more than 1 CUDA card at a time. Physics simulations flat out flying on GPU CUDA Programming and Performance	12	12683	March 12, 2010
How NVLink Will Enable Faster, Easier Multi-GPU Computing Technical Blog	10	943	June 15, 2016
CUDA hardware & software CUDA Programming and Performance	9	2793	November 13, 2010
Inside Pascal: NVIDIA's Newest Computing Platform Technical Blog	51	1164	December 8, 2017

This is my idea about GPU multi-DIE design

Related topics