Description When using pytorch_quantization with Hugging Face models, whatever the seq len, the batch size and the model, int-8 is always slower than FP16. TensorRT models are produced with trtexec (see below) Many PDQ nodes are just before a transpose node and then the matmul. I am under the impre…

[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16

pommedeterresautee January 6, 2022, 8:56am 6

Topic		Replies	Views
Bpnet sample code error TAO Toolkit	13	775	October 11, 2022
EfficientDet in Deepstream Causes a Seg Fault TAO Toolkit efficientdet , tao	15	1068	July 19, 2022
Failed to create .engine File TAO Toolkit	33	2037	July 11, 2022
TensorRT8 INT8 (signed char) I/O interface for ONNX model TensorRT tensorrt , onnx	4	1364	February 15, 2022
TAO toolkit fails to convert RetinaNet INT8 etlt model to INT8 CUDA engine (calibration cache needs to be deleted?) TAO Toolkit tensorrt , cuda	4	458	June 10, 2022
Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX Jetson Orin NX tensorrt	24	219	March 26, 2025
Is the deepstream_lpr_app supposed to work with DS 7.0? DeepStream SDK	7	235	May 17, 2024
Tensorrt fp32 inference slower than pytorch on tesla T4 for groundingDINO TensorRT cudnn	1	554	January 22, 2024
Low performance when running pipeline with RTX 4090 DeepStream SDK	24	542	March 21, 2024
Tensorrt can not speed up well TensorRT	7	1612	June 29, 2022