Nvidia GB200 NVL72

MattMcLoughlin · November 27, 2024, 7:54pm

Hi,

How many GPUs can be accessed from a single process on the Nvidia GB200 NVL72? Can a single process access all 72? I’ve written some test code for this - can someone from NVIDIA please run this on one of these machines?

curl -sSL https://raw.githubusercontent.com/MattMcL4475/gpu/refs/heads/main/test_gpus.py | python3 -

Thank you!
Matt

MarkusHoHo · November 28, 2024, 9:12am

Hi @MattMcLoughlin and welcome to the NVIDIA developer forums.

I am afraid we cannot download this kind of content and run it on one of our internal GB200s.

And since GB200s are currently only supplied through Cloud and Service providers I suggest you contact their support directly.

In terms of implementation you sure should be able to access all GPUs from one process. That is the whole idea of NVLINK. Sorry to quote marketing, but

The GB200 NVL72 is a liquid-cooled, rack-scale solution that boasts a 72-GPU NVLink domain that acts as a single massive GPU.

How you parallelize or distribute workloads is completely up to the implementation.

MattMcLoughlin · November 30, 2024, 5:33am

Hi Markus, thanks for the reply. NVLink will ensure memory access, but more specifically, can a single process launch a kernel on all 72? Are there any technical documents besides the marketing collateral that might confirm this?

I wrote the simplest possible code to test this using PyTorch: here is the GitHub link.

Can you confirm that this same code (or equivalent) would successfully execute matmul on all 72? That would definitively answer the question.

Thank you,
Matt

avnf · April 13, 2025, 6:16am

GB200 servers has 4 GPU per server and 18 servers (4*18=72):

github.com/mlcommons/inference_results_v5.0

closed/NVIDIA/systems/GB200-NVL72_GB200-186GB_aarch64x72_TRT_Triton.json

main


      
          {
              "accelerator_frequency": "",
              "accelerator_host_interconnect": "NVLink-C2C",
              "accelerator_interconnect": "5th-gen NVLink andNVLink Switches",
              "accelerator_interconnect_topology": "",
              "accelerator_memory_capacity": "186 GB",
              "accelerator_memory_configuration": "HBM3e",
              "accelerator_model_name": "NVIDIA GB200",
              "accelerator_on-chip_memories": "",
              "accelerators_per_node": 4,
              "boot_firmware_version": "",
              "cooling": "Mix of Liquid-cooled (for CPU, GPU, CX7, NVLink Swithces) and Air-cooled (for the rest of the components)",
              "disk_controllers": "NVMe",
              "disk_drives": "SSD",
              "division": "closed",
              "filesystem": "",
              "framework": "TensorRT 10.8, CUDA 12.8",
              "host_memory_capacity": "960 GB",
              "host_memory_configuration": "for each 4-GPU node: 1.3TB LPDDR5x",
              "host_network_card_count": "for each 4-GPU node: 4x Mellanox MT43244 BlueField-3 [ConnectX-7 Lx], 1x Intel I210 Gigabit Ethernet",

“accelerators_per_node”: 4,
“number_of_nodes”: 18,
Parallelism used is PP2TP2 (this is 4 GPU per node) inference_results_v5.0/closed/NVIDIA/configs/llama3_1-405b/Server/__init__.py at main · mlcommons/inference_results_v5.0 · GitHub
class GB200_NVL_186GB_ARM_TP2PP2x18_Triton(GB200_NVL_186GB_ARM_TP2PP2x1_Triton):

Multinode instructions inference_results_v5.0/closed/NVIDIA/code/harness/harness_triton_llm/README.md at 42a6085696e9a0a3c339dea4786aacf70678bd61 · mlcommons/inference_results_v5.0 · GitHub

Azure had Ring in NCCL tests of their GB200, and specifically set Ring for 4 nodes
AI-benchmarking-guide/Azure_Results/ND_GB200_v6_results.md at main · Azure/AI-benchmarking-guide · GitHub AI-benchmarking-guide/Benchmarks/NVIDIA/NCCLBandwidth.py at main · Azure/AI-benchmarking-guide · GitHub

Topic		Replies	Views
Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch Technical Blog	2	710	April 9, 2024
NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference Technical Blog	14	2007	September 27, 2024
NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project Technical Blog	3	157	January 25, 2025
Programming with NVLINK CUDA Programming and Performance	9	5766	April 18, 2018
수조 개의 파라미터 LLM 트레이닝 및 실시간 추론을 제공하는 NVIDIA GB200 NVL721 Technical Blog - South Korea	1	244	April 3, 2024
Announcing NVIDIA DGX GH200: The First 100 Terabyte GPU Memory System Technical Blog	0	787	May 29, 2023
Simplify System Memory Management with the Latest NVIDIA GH200 NVL2 Enterprise RA Technical Blog	3	27	April 30, 2025
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	3516	March 10, 2018
One Giant Superchip for LLMs, Recommenders, and GNNs: Introducing NVIDIA GH200 NVL32 Technical Blog	0	537	November 28, 2023
NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference Technical Blog	1	45	August 12, 2024

Nvidia GB200 NVL72

Related topics