Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization


A clear and concise description of the bug or issue.


TensorRT Version:
GPU Type: NVIDIA GeForce RTX 3090
Nvidia Driver Version: 515.65.01
CUDA Version: 11.3
CUDNN Version: 8.4
Operating System + Version: Centos7
Python Version (if applicable): 3.8
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

model size

int8 8.1M
float16 15.4MB
float32 29.9M

Hi, Please refer to the below links to perform inference in INT8