NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

jwitsoe August 12, 2024, 2:00pm 1

Originally published at: https://developer.nvidia.com/blog/nvidia-nvlink-and-nvidia-nvswitch-supercharge-large-language-model-inference/

Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are…

Topic		Replies	Views
NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models Technical Blog llama	2	19	November 27, 2024
Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch Technical Blog llama	2	25	October 20, 2024
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	257	March 27, 2024
Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Technical Blog llama	1	28	August 28, 2024
Demystifying AI Inference Deployments for Trillion Parameter Large Language Models Technical Blog	2	163	July 11, 2024
NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference Technical Blog	14	1877	September 27, 2024
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1028	September 27, 2023
LLM, 추천 시스템 및 GNN을 위한 하나의 거대한 슈퍼칩: NVIDIA GH200 NVL32 Technical Blog - South Korea	0	556	November 30, 2023
수조 개의 파라미터 LLM 트레이닝 및 실시간 추론을 제공하는 NVIDIA GB200 NVL721 Technical Blog - South Korea	1	242	April 3, 2024
Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch Technical Blog	2	696	April 9, 2024

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

Related topics