Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs

jwitsoe · August 28, 2024, 7:31pm

Originally published at: https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To…

soroush731 · September 17, 2024, 5:42pm

accessibility:

Speed is not the only thing that matters for the LLMs. The other thing that i hope to see from nvidia is how to make this technology more accessible, specially with it requiring lots of ram and currently some of the best laptops by Lenovo or HP with a 4090 can only have maximum of 96gb of ram. maybe M.2 Modules that would act like ram with the pcie gen5 than would be 256 or 512gb and can fit in the second m.2 slot.

the other thing that i want to see the prebuilt Kubernetes models. so if you want to train a model and don’t care about time and you are just learning. you can have multiple devices on the network to work as one.
for if someone has multiple machine like multiple gaming computers or etc

Topic		Replies	Views
NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 Technical Blog	0	414	December 5, 2023
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	195	September 17, 2024
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	193	February 3, 2025
NVIDIA 플랫폼 전반에서 Llama 3.1 강화하기 Technical Blog - South Korea llama	1	27	August 2, 2024
Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer Technical Blog	1	29	September 10, 2024
Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs Technical Blog llama	2	92	July 17, 2025
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1053	September 27, 2023
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	285	May 3, 2024
NVIDIA H200에서 거대 언어 모델 속도 향상을 제공하는 NVIDIA TensorRT-LLM Technical Blog - South Korea	0	527	December 8, 2023
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B Technical Blog llama	3	72	October 24, 2024

Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs

Related topics