Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch

jwitsoe · October 9, 2024, 3:00pm

Originally published at: https://developer.nvidia.com/blog/boosting-llama-3-1-405b-throughput-by-another-1-5x-on-nvidia-h200-tensor-core-gpus-and-nvlink-switch/

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high…

sergof · October 20, 2024, 6:02am

I believe that Table 1 has a typo. The minimum latency got swapped between TP and PP.
Could you please check and fix?

Topic		Replies	Views
Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Technical Blog llama	1	86	August 28, 2024
NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference Technical Blog	1	120	August 12, 2024
Scaling Language Model Training to a Trillion Parameters Using Megatron Technical Blog	1	811	April 12, 2021
NVLink Switch가 탑재된 NVIDIA HGX H200의 Medusa로 최대 1.9배 향상된 Llama 3.1 성능 Technical Blog - South Korea llama	1	39	August 30, 2024
The New Parallel Forall Technical Blog	1	322	November 12, 2013
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	89	September 17, 2024
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23105	October 12, 2010
Pipelined Loads CUDA Programming and Performance	54	7483	September 21, 2010
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1106	September 27, 2023
Demystifying AI Inference Deployments for Trillion Parameter Large Language Models Technical Blog	3	229	April 17, 2025