Description
I am trying to quantize a convnext model to int8 but when I run inference it runs slower than my non quantized model.
Environment
TensorRT Version: 10.5.0
GPU Type: RTX 4090
Nvidia Driver Version: 556.12
CUDA Version:
CUDNN Version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
Operating System + Version: Ubuntu 24.04.1 LTS
Python Version (if applicable): 3.11
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.4.1cu11.8 (uses onnxruntime 1.16.1 when calling torch.onnx.export)
Baremetal or Container (if container which image + tag):
Relevant Files
quantization - Google Drive (drive folder with notebook, requrements file and onnx files to reproduce)