This is my idea about GPU multi-DIE design

89


This is my idea about GPU multi-DIE design, a multi-DIE design scheme to improve GPU computing density
That ZRLink is actually about the same as NVLink
The big DIE in the middle has the function of NVSwitch and integrates the RISC processor to ensure that each GPU and GPU can maintain low-latency communication and improve computing efficiency.
The big DIE in the middle has the function of a switch, which can ensure that each GPU and the connected GPU can maintain low-latency communication, and there is also a corresponding VRAM controller to ensure that its internal risc processor can work normally. Each GPU can individually access VRAM via a high-speed bus and interact with each other. Due to the internal risc processor, the GPU can be reasonably called to increase the efficiency of GPU interconnection
987 ZRLink ZRLink2
Plus the PCB is like this, there are 3200PIN contacts (ZRLink) on the back for the interconnection between the graphics card and the graphics card, which can provide higher bandwidth than PCIE and NVLink. The power supply part of the graphics card is on the motherboard, which is similar to The CPU allows a larger area for the GPU to be placed on the PCB. The large DIE in the middle of each computing card is directly connected to the large DIE to reduce access latency. Each GPU and large DIE can individually access VRAM via a high-speed bus and interact with each other (HBM2 is preferred, and GDDR6 is used when HBM2 is full)
This GPU multi-DIE design is conducive to solving the waste of production capacity, as well as reducing wafer production costs and improving yield
The gpu is fixedly designed as a small core and a small area, so that the wafer can be effectively used to reduce the cost and defect rate, and then can achieve different performance according to how many cores are stacked for different users. Therefore, while achieving high-performance computing, it can also reduce costs
89
This is the internal topology of the GPU system (for reference only)
Inside that big DIE are two and a half ring buses (the gray one is the bus, which looks like a ring, actually 2X half a ring). Both the upper and lower parts can be independently GPU interconnected, so if you say There is a damage that does not affect the mutual access between the core and the core, but the bandwidth will be halved
The left side of the arrow is the previous interconnection method. Due to the excessive number of ZRLink nodes and PCB wiring difficulties, it is no longer used, so it is simplified to the design on the right side of the arrow. In the 4-core GPU system, each ZRLink has a node for internal Interconnected, and there is a high-speed bus inside the core, so the access delay across the DIE is not serious (see the GPU core architecture diagram for details) In the 8-core GPU system, each GPU ZRLink has two nodes, one of which is used for 4-core The internal interconnection of the GPU system can avoid the high-speed hub in the middle of the 8-core GPU system when the load is low, so that it can be bridgeless interconnection like the 4-core GPU system, so that some hub buses and interfaces can be closed, thereby reducing Power consumption and latency. When the load is high, the two nodes of each GPU and the connected nodes will exchange data to reduce access delay. And there is a RISC processor inside the big DIE in the middle, which can play a role in assisting calculation and rationally calling GPU to improve calculation efficiency
This is the core architecture diagram of the GPU (the core code is RedStone 100, which I named)
The top four memory controllers are used to control HBM2 (HBM2E) video memory, which is closer to the video memory for easy wiring
The two memory controllers on the upper left and upper right are used to control GDDR6, and you can add video memory.
PCIE 5.0 host interface, two ZRLinks above are high-speed bus & high-speed hub, which is convenient for data exchange inside and outside the core and GPU interconnection
As for the SM unit, similar to VOLTA SM, all retain the FP64 double-precision calculation unit. Each SM unit has 32 FP64 units, 64 FP32 units, 64 INT32 units, 32 Tensor tensor cores, and an RT CORE . Each group of GPC units has 8 SM units, and each GPU DIE has two GPC units, for a total of 1024 CUDA
36 ZRLink
Then this is the memory card
There are 16 HBM2 (HBM2E) memory, 32 GDDR6
In the middle is the video memory controller, which is connected to the GPU through ZRLink and the high-speed hub, so that both the memory and the GPU can maintain low-latency communication.

1 Like


This is the topology of the multi-ZRHub system (the big DIE in the middle, I named it ZRHub, the internal structure is for reference only)
This topology looks a bit like the internal topology of the previous GPU system. Indeed, because the previous GPU DIE specification is not large, it does not require too many nodes for interconnection, and this ZRHub needs to connect more GPUs as GPU interconnect hubs have relatively high bandwidth requirements, so more nodes are required.
A ZRHub has a total of 32 ZRLinks. In a system of 8 graphics cards (all with ZRHub), each ZRHub has 4 ZRLink nodes for the connection between the hub and the hub (for convenience, I have drawn the nodes together) There are 16 ZRLink nodes used for interconnection between GPUs and GPUs. In 8 ZRHub systems, each ZRHub has 8 ZRLink redundancy, which is convenient for later expansion. The sky blue in the middle of ZRHub is the cache, which is connected to two semi-circular high-speed buses, and the next 8 are RISC processor cores (although CISC’s instructions are rich, but the RISC architecture’s instructions, registers and pipeline characteristics make it very suitable for parallel Computing, so I personally prefer RISC). Because ZRHub has an integrated processor, it can be used normally without the need for an additional CISC processor. However, for greater compatibility considerations, we are building high performance When calculating, it is best to use with CISC processor


This is the topology of 16 ZRHub systems
Compared with the previous picture, I made some changes. I put the idle ZRLinks on both sides in the middle to increase the bandwidth of the ZRHub interconnection. Therefore, there are 16 ZRLinks in a ZRHub participating in each other’s interconnection. It is 8 line segments, because I have drawn the same path together. In fact, there are 16 lines. Similarly, in a ZRHub GPU system, there are also 16 ZRLinks responsible for the interconnection between GPUs. Of course, it is best to leave some Redundant ZRLink (I only drew 16 in the picture, these are all used to interconnect the GPU, and there is no redundancy, in fact should be more), in order to facilitate later expansion and unified addressing, if at the same time Use ZRLink to achieve the interconnection of CPU and GPU, so that the GPU is also fully connected to the CPU. If the CPU is directly connected to ZRHub, pcie will no longer be responsible for connecting the cpu and the gpu, which is beneficial to increase bandwidth and reduce access delay ( But this requires an open agreement to other manufacturers). I changed the position of ZRLink on both sides to PCIE Switch to facilitate the connection of CPU, NIC, and expansion card. Because ZRHub integrates a PCIe switch, there is no need for a separate PCIE switch, so this CPU and expansion The card can be directly connected to ZRHub to reduce latency and improve efficiency

The above is my main idea, and of course there are the following and more, if I have time, I will continue to add. I am a high school student and will start school immediately. I may not have time recently. At the same time, I am also a fan of NVIDIA and have a strong interest in hardware. If you have some views on these, you are welcome to discuss.I look forward to your early reply, thank you!

MPC2

供电2

To add, due to the high bandwidth requirements for interconnection between GPU and GPU, it is necessary to use a silicon substrate package on a substrate to ensure that there is sufficient interconnection bandwidth between GPU DIEs, because it is difficult to meet higher bandwidths by PCB wiring.

To add, that ZRHub is actually a Switch, which is responsible for data exchange between ZRLinks, not a HUB (集线器). Since this is a soc, it means “枢纽”, so it is called ZRHub


This is the core architecture diagram of the new GPU (the core code is Diamond 100, which I named)
The top four memory controllers are used to control HBM2 (HBM2E) video memory, which is closer to the video memory for easy wiring
The two memory controllers on the upper left and upper right are used to control GDDR6, and you can add video memory.
PCIE 5.0 host interface, 4 ZRLink above are high-speed bus & high-speed hub, easy to exchange data inside and outside the core and GPU interconnection
Based on AMPERE SM, each SM unit has 32 FP64 units, 64 FP32 units, 64 INT32 units, and 4 tensor cores. Each group of GPC units has 10 SM units, each GPU DIE has two GPC units, a total of 1280 CUDA, supports multi-instance GPU (MIG) technology, you can divide a GPU DIE into up to 20 independent GPUs For example, a GPC unit or TPC unit or SM unit can be divided into a GPU instance, so a GPU DIE can be divided into 2 or 10 or 20 GPU instances, a total of three modes, each mode of GPU instance can Run at the same time, and each mode can be matched with each other. GPU instances of different modes can run simultaneously. Each instance has its own memory, cache, and streaming multiprocessor. In multiple GPU instances, a large number of The client provides GPU cloud acceleration, such as each mobile phone account, and different resources can be allocated according to different accounts, and in some lower-configuration clients, the GPU cloud acceleration can make the game screen less stagnant, and can The server used in the school provides GPU computing power for each client, and of course there are more… And the number of ZRLinks of this GPU core doubles, which also means that the Internet bandwidth doubles


This is another GPU core architecture diagram, core code Slime 100
Based on the third-generation TENSOR CORE, and the number doubled compared to the above, 20SMs, 1280FP32CUDA
The FP64 and INT32 units were cut off to make room for TENSOR CORE for AI training


The number of ZRLinks has doubled and the bandwidth has doubled, and the new ZRLink 3.0 interconnection technology is used
The above is the ZRLink3.0 bridgeless interconnection version. In the 4-core GPU system, two ZRLinks are added, and a total of 12 ZRLink nodes participate in the interconnection, which enables peer-to-peer communication between the 4 GPUs, compared with the previous generation The 8 ZRLink nodes have a qualitative leap in interconnection, and each GPU retains a ZRLink redundancy for future expansion. In the 6-core GPU system, a total of 24 ZRLink nodes and 6 PCIE 5.0 host interfaces participate in the interconnection, a total of 12 ZRLink lines, 3 PCIE 5.0 lines, and combined to achieve peer-to-peer communication, due to the 6-core GPU In the system, in order to achieve high-speed internal communication, all ZRLink nodes participate in the internal interconnection on the same substrate, and there is no redundancy, but each GPU DIE still has a PCIE 5.0 host interface to participate in the external interconnection, and the interconnection bandwidth is also sufficient of. Of course, in the 8-core GPU system, the Switch will participate in the interconnection, and there are enough ZRLink nodes for external interconnection, and the performance is higher than the 6-core GPU system.