Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer

jwitsoe · September 10, 2024, 4:00pm

Originally published at: https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/

As large language models (LLMs) are becoming even bigger, it is increasingly important to provide easy-to-use and efficient deployment paths because the cost of serving such LLMs is becoming higher. One way to reduce this cost is to apply post-training quantization (PTQ), which consists of techniques to reduce computational and memory requirements for serving trained…

Topic		Replies	Views
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	51	September 17, 2024
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1041	September 27, 2023
Selecting Large Language Model Customization Techniques Technical Blog	0	397	August 10, 2023
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	835	December 3, 2023
Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available Technical Blog	4	283	July 16, 2024
Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 2 Technical Blog	1	143	May 13, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	275	May 3, 2024
NVIDIA AI Platform Delivers Big Gains for Large Language Models Technical Blog	0	415	July 28, 2022
Mastering LLM Techniques: Training Technical Blog	0	464	November 16, 2023
How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model Technical Blog llama	8	202	October 4, 2024

Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer

Related topics