Originally published at: NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 | NVIDIA Technical Blog
Large language models (LLMs) have seen dramatic growth over the last year, and the challenge of delivering great user experiences depends on both high-compute throughput as well as large amounts of high-bandwidth memory. NVIDIA TensorRT-LLM provides optimizations for both peak throughput and memory optimization, delivering massive improvements in LLM inference performance. The latest TensorRT-LLM enhancements…