NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference

Originally published at: https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/

What is the interest in trillion-parameter models? We know many of the use cases today and interest is growing due to the promise of an increased capacity for: Natural language processing tasks like translation, question answering, abstraction, and fluency. Holding longer-term context and conversational ability. Multimodal applications combining language, vision, and speech. Creative applications like…

Can you clarify if the BW is 900GB/s or 1.8TB/s (see the quote below)? Is liquid immersion required for 200Gbps serdes, 1.8TB/s NVLink Gen5 to function? Thanks.
“The heart of the GB200 NVL72 is the NVIDIA GB200 Grace Blackwell Superchip. It connects two high-performance NVIDIA Blackwell Tensor Core GPUs and the NVIDIA Grace CPU with the NVLink-Chip-to-Chip (C2C) interface that delivers 900 GB/s of bidirectional bandwidth.”

I believe GPU-GPU NVLink BW is 1.8TB/s but as the Grace CPU is being reused from GH100, CPU-GPU NVLink BW is still 900GB/s.

B100 and B200 seem to have 180GB GPU memory as per this article: NVIDIA HGX AI Supercomputing Platform
but GB200 with two GPUs seems to have 394GB (192GB per GPU) of GPU memory. Wondering if there is a typo somewhere.

Hi Kiranpalli,

Thanks for the clarification. Are the NVLink 5th gen 1.8TB Cu cables the same as those of NVLink gen 4 (0.9TB)? Thanks.

Best,
Peter

pldchange, NVLink-C2C is a 900GB/s bidirectional interconnect between CPU and GPU, used for Grace and Hopper or Blackwell GPU(s). In the case of GB200, it is shared between two Blackwell GPUs and Grace. 5th generationNVLink is a1.8TB/s interconnect between GPUs. Liquid cooling is used to dissipate the heat energy from the Superchip but not related to the SerDes to function. The 5th generation NVLink cables are spec’d for the fast 1.8TB/s data rate.

“PCIe gen 6 support for high-speed networking” Can you clarify if this is Gen 5 or Gen 6?

Congradulation! That’s indeed an impressive improvement.
May I know more about the inference environment? like the parallism mode in the inference, the batch size and so on.

PCIe Gen 6 is supported.

For “The NVIDIA GB200 NVL72 introduces fifth-generation NVLink, which connects up to 576 GPUs in a single NVLink domain with over 1 PB/s total bandwidth and 240 TB of fast memory.”: Is this done with L2/L3 NVlink switches? The one-Rack NVL72 seems to saturates all 9 NVlink switch ports with 72 GPUs. FOr 576 GPU superPod, per rack GPU need to reduced to 36 to leave half of the NVlink Switch ports for uplinks connection to L2/L3 fabrics to scale up to 576 GPUs?

Welcome rangchen.yu, there are multiple network topology options for NVL576 which we review with customers to match their workloads and all-to-all performance needs. One topology option may use L2 switches where a number of ports are used as uplinks so compute trays fully connect every GPU to NVLink. Closer to shipping we will publish a reference design document detailing NVL576. In the interim, if you are an interested customer, please work with your NVIDIA account team to match options to your unique needs.

Thanks for the reply. Looks like the GPU to L1 switch tray will need to be 72:18 instead of 72:9 in NVL72. Do you do this by reduce GPU per rack to 36, leave 9 switch tray, or GPU per rack stay at 72, but need to add 9 more NVlink Swiches to make uplink ports available? Will space, power available for this full 72 GPUs with 18 switches?

Will the fully NVLink connected NVL576 be based on NVL72 or NVL36?

Can you provide a link to the diagnostic tooling documentation for GB200 NVL72