Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration

jwitsoe · May 26, 2023, 4:58pm

Originally published at: https://developer.nvidia.com/blog/sparsity-in-int8-training-workflow-and-best-practices-for-tensorrt-acceleration/

The training stage of deep learning (DL) models consists of learning numerous dense floating-point weight matrices, which results in a massive amount of floating-point computations during inference. Research has shown that many of those computations can be skipped by forcing some weights to be zero, with little impact on the final accuracy. In parallel to…

Topic		Replies	Views
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	892	December 3, 2023
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT Technical Blog	13	2982	June 2, 2023
Improving INT8 Accuracy Using Quantization Aware Training and the NVIDIA Transfer Learning Toolkit Technical Blog	0	504	August 25, 2020
Structured sparsity not working with explicit quantization TensorRT tensorrt	5	1053	March 31, 2022
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT Technical Blog	0	407	June 16, 2022
TensorRT the inference is slow for the QAT model comparing to the PTQ case Jetson AGX Xavier tensorrt , nvbugs	19	1764	January 16, 2023
Sparsity does not provide any speedup for TensorRT on DLA Jetson AGX Orin cudnn	6	1093	January 22, 2024
Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX Jetson Orin NX tensorrt	24	517	March 26, 2025
Problem with structured sparsity and explicit quantization (PTQ) on Tiny-Yolov7 TensorRT	5	847	May 26, 2023
Int8 TensorCores for Jetson Jetson AGX Xavier tensorrt	7	1388	April 26, 2023

Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration

Related topics