This is my idea about GPU multi-DIE design, a multi-DIE design scheme to improve GPU computing density
That ZRLink is actually about the same as NVLink
The big DIE in the middle has the function of NVSwitch and integrates the RISC processor to ensure that each GPU and GPU can maintain low-latency communication and improve computing efficiency.
The big DIE in the middle has the function of a switch, which can ensure that each GPU and the connected GPU can maintain low-latency communication, and there is also a corresponding VRAM controller to ensure that its internal risc processor can work normally. Each GPU can individually access VRAM via a high-speed bus and interact with each other. Due to the internal risc processor, the GPU can be reasonably called to increase the efficiency of GPU interconnection



Plus the PCB is like this, there are 3200PIN contacts (ZRLink) on the back for the interconnection between the graphics card and the graphics card, which can provide higher bandwidth than PCIE and NVLink. The power supply part of the graphics card is on the motherboard, which is similar to The CPU allows a larger area for the GPU to be placed on the PCB. The large DIE in the middle of each computing card is directly connected to the large DIE to reduce access latency. Each GPU and large DIE can individually access VRAM via a high-speed bus and interact with each other (HBM2 is preferred, and GDDR6 is used when HBM2 is full)
This GPU multi-DIE design is conducive to solving the waste of production capacity, as well as reducing wafer production costs and improving yield
The gpu is fixedly designed as a small core and a small area, so that the wafer can be effectively used to reduce the cost and defect rate, and then can achieve different performance according to how many cores are stacked for different users. Therefore, while achieving high-performance computing, it can also reduce costs

This is the internal topology of the GPU system (for reference only)
Inside that big DIE are two and a half ring buses (the gray one is the bus, which looks like a ring, actually 2X half a ring). Both the upper and lower parts can be independently GPU interconnected, so if you say There is a damage that does not affect the mutual access between the core and the core, but the bandwidth will be halved
The left side of the arrow is the previous interconnection method. Due to the excessive number of ZRLink nodes and PCB wiring difficulties, it is no longer used, so it is simplified to the design on the right side of the arrow. In the 4-core GPU system, each ZRLink has a node for internal Interconnected, and there is a high-speed bus inside the core, so the access delay across the DIE is not serious (see the GPU core architecture diagram for details) In the 8-core GPU system, each GPU ZRLink has two nodes, one of which is used for 4-core The internal interconnection of the GPU system can avoid the high-speed hub in the middle of the 8-core GPU system when the load is low, so that it can be bridgeless interconnection like the 4-core GPU system, so that some hub buses and interfaces can be closed, thereby reducing Power consumption and latency. When the load is high, the two nodes of each GPU and the connected nodes will exchange data to reduce access delay. And there is a RISC processor inside the big DIE in the middle, which can play a role in assisting calculation and rationally calling GPU to improve calculation efficiency
This is the core architecture diagram of the GPU (the core code is RedStone 100, which I named)
The top four memory controllers are used to control HBM2 (HBM2E) video memory, which is closer to the video memory for easy wiring
The two memory controllers on the upper left and upper right are used to control GDDR6, and you can add video memory.
PCIE 5.0 host interface, two ZRLinks above are high-speed bus & high-speed hub, which is convenient for data exchange inside and outside the core and GPU interconnection
As for the SM unit, similar to VOLTA SM, all retain the FP64 double-precision calculation unit. Each SM unit has 32 FP64 units, 64 FP32 units, 64 INT32 units, 32 Tensor tensor cores, and an RT CORE . Each group of GPC units has 8 SM units, and each GPU DIE has two GPC units, for a total of 1024 CUDA


Then this is the memory card
There are 16 HBM2 (HBM2E) memory, 32 GDDR6
In the middle is the video memory controller, which is connected to the GPU through ZRLink and the high-speed hub, so that both the memory and the GPU can maintain low-latency communication.