NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

jwitsoe · September 8, 2023, 5:00pm

Originally published at: NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs | NVIDIA Technical Blog

Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution characteristics can make them difficult to use in cost-effective ways. NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML, now a part of Databricks,…

foh · September 8, 2023, 5:28pm

Beyond the diverse number of LLMs, interesting read on Inference performance optimization techniques like tensor parallelism, In-flight batching, new FP8 quant size, and Hopper Transformer Engine support.

PaoloRudelli · September 9, 2023, 1:26am

Very interesting! I can’t wait to see the performance improvements in FlashAttention and masked multi-head attention in real cases.

xun.wang · September 9, 2023, 5:46am

Exciting stuff, can’t wait to try this out with the Bloomreach team!

john.knott · September 9, 2023, 9:18am

Will this offer any increase to inference speeds on consumer cards like 4090 or 3090?

foh · September 27, 2023, 4:57pm

Hi John, yes, TensorRT-LLM will increase performance on consumer cards. We’re working to publish final numbers for GeForce. Many of the capabilities like optimized kernels, pre- and post-processing steps, and caching algorithms will increase performance on all NVIDIA GPUs. However consumer cards and data center cards have different capabilities like the transformer engine, amount and type of memory, multi-GPU connectivity, etc. Therefore the specific performance gain will be different between the various cards and depend on many factors.

Topic		Replies	Views
추론 성능 가속화하는 새로운 소프트웨어 TensorRT-LLM 출시 Technical Blog - South Korea korean	0	631	September 12, 2023
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	254	March 27, 2024
Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and TensorRT-LLM Technical Blog	1	88	July 2, 2024
NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 Technical Blog	0	406	December 5, 2023
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1626	January 25, 2024
NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1 Technical Blog	2	15	August 28, 2024
New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility Technical Blog	0	493	December 4, 2023
NVIDIA AI Platform Delivers Big Gains for Large Language Models Technical Blog	0	407	July 28, 2022
Demystifying AI Inference Deployments for Trillion Parameter Large Language Models Technical Blog	2	156	July 11, 2024
Deploying Retrieval-Augmented Generation Applications on NVIDIA GH200 Delivers Accelerated Performance Technical Blog	3	650	February 21, 2024

NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

Related topics