Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch

jwitsoe · August 28, 2024, 1:01pm

Originally published at: https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand. Performance depends both on the ability for the combined GPUs to process requests as “one mighty GPU” with ultra-fast GPU-to-GPU communication and advanced software…

Topic		Replies	Views
NVLink Switch가 탑재된 NVIDIA HGX H200의 Medusa로 최대 1.9배 향상된 Llama 3.1 성능 Technical Blog - South Korea llama	1	63	August 30, 2024
Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch Technical Blog llama	2	127	October 20, 2024
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	289	February 3, 2025
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	340	March 27, 2024
Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick Technical Blog llama	3	195	September 10, 2025
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 Technical Blog llama	2	91	November 27, 2024
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	210	January 9, 2025
NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference Technical Blog	1	175	August 12, 2024
Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and TensorRT-LLM Technical Blog	1	173	July 2, 2024
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	161	September 17, 2024