Quick question - when running inference on a quantized network, should the input dtype be int8 or fp32?
In all of the example codes I’ve run across the input is always fp32, but in the Nvidia MLPerf GitHub repo, the input is INT8.
I tried both, but I haven’t seen any significant speedup, which is weird since you’d expect the overhead for copying an fp32 tensor to be significantly larger than an int8 tensor.
TensorRT Version: 7.2
GPU Type: T4
Nvidia Driver Version: 455.38
CUDA Version: 11.1
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6
Baremetal or Container (if container which image + tag):