Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance

jwitsoe · September 26, 2024, 9:44pm

Originally published at: https://developer.nvidia.com/blog/low-latency-inference-chapter-2-blackwell-is-coming-nvidia-gh200-nvl32-with-nvlink-switch-gives-signs-of-big-leap-in-time-to-first-token-performance/

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting…

Topic		Replies	Views
One Giant Superchip for LLMs, Recommenders, and GNNs: Introducing NVIDIA GH200 NVL32 Technical Blog	0	571	November 28, 2023
NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference Technical Blog	14	2270	September 27, 2024
Tesla P4, P40 Accelerators Deliver 45x Faster AI CUDA Programming and Performance	11	3157	September 19, 2016
NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models Technical Blog llama	2	72	November 27, 2024
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	305	March 27, 2024
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1112	September 27, 2023
NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference Technical Blog	1	134	August 12, 2024
Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Technical Blog llama	1	91	August 28, 2024
NVIDIA Turing Architecture In-Depth Technical Blog	12	917	September 25, 2018
LLM, 추천 시스템 및 GNN을 위한 하나의 거대한 슈퍼칩: NVIDIA GH200 NVL32 Technical Blog - South Korea	0	586	November 30, 2023