The inference speed of yolov5 tensorrt has little difference between int8 and fp16


I use yolov5 model from GitHub - ultralytics/yolov5: YOLOv5 馃殌 in PyTorch > ONNX > CoreML > TFLite,
and use the code from tensorrtx/yolov5 at master 路 wang-xinyu/tensorrtx 路 GitHub to convert the pytorch model to .wts, then convert to fp16 or int8 tensorrt model.
But I found that the inference speed and memory consumption of the fp16 and int models are approximative, unlike the obvious gap between fp16 model and fp32 model, there is only about a 10% improvement from fp16 to int8 model.
And I try differcent cards 2080ti/t4 and different model yolov5/simple classification model, results are same.


TensorRT Version: 7.0
GPU Type: 2080ti/T4
Nvidia Driver Version:
CUDA Version: 10.0
CUDNN Version: 7.6
Operating System + Version: ubuntu16.04
Python Version (if applicable): 3.7
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.7
Baremetal or Container (if container which image + tag):

Relevant Files


convert code:

Steps To Reproduce

  1. download yolov5 model(any version is OK) from Releases 路 ultralytics/yolov5 路 GitHub

2.follow tensorrtx/yolov5 at master 路 wang-xinyu/tensorrtx 路 GitHub
" How to Run, yolov5s as example"
" INT8 Quantization"

  1. serialize the int8/fp16 model and test, compare the speed of int8/fp16 model


Could you please try on the latest TensorRT version 8.4.3.
Please share with us the ONNX model and trtexec --verbose logs for better debugging.

Thank you.