NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

Originally published at: NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs | NVIDIA Technical Blog

Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution characteristics can make them difficult to use in cost-effective ways.  NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML, now a part of Databricks,…

2 Likes

Beyond the diverse number of LLMs, interesting read on Inference performance optimization techniques like tensor parallelism, In-flight batching, new FP8 quant size, and Hopper Transformer Engine support.

1 Like

Very interesting! I can’t wait to see the performance improvements in FlashAttention and masked multi-head attention in real cases.

Exciting stuff, can’t wait to try this out with the Bloomreach team!

Will this offer any increase to inference speeds on consumer cards like 4090 or 3090?

Hi John, yes, TensorRT-LLM will increase performance on consumer cards. We’re working to publish final numbers for GeForce. Many of the capabilities like optimized kernels, pre- and post-processing steps, and caching algorithms will increase performance on all NVIDIA GPUs. However consumer cards and data center cards have different capabilities like the transformer engine, amount and type of memory, multi-GPU connectivity, etc. Therefore the specific performance gain will be different between the various cards and depend on many factors.