The results I get on a Tesla V100-DGXS-16GB:
FP32:
Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.00536393909715116, 0.005346858990378678, 0.0053372320253401995, 0.005339278024621308, 0.005336352973245084, 0.005338195012882352, 0.005339709925465286, 0.005349859944544733, 0.005346192046999931, 0.005334021989256144]
FP16:
Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.002218535984866321, 0.002207414945587516, 0.0021951169474050403, 0.00219749310053885, 0.002194416942074895, 0.002198559930548072, 0.0021935830591246486, 0.0021958909928798676, 0.002192566986195743, 0.0021922640735283494]
INT8:
Loading and serializing ONNX model...
Build engine...
[ImageBatchStream] Processing /tmp/src/../data/binoculars.jpeg
[ImageBatchStream] Processing /tmp/src/../data/mug-cc0.jpeg
[ImageBatchStream] Processing /tmp/src/../data/canon-cc0.jpeg
[ImageBatchStream] Processing /tmp/src/../data/tabby_tiger_cat.jpg
Loading engine and infer...
Allocating buffers ...
[0.0022265249863266945, 0.002199487993493676, 0.002194601926021278, 0.002190993051044643, 0.0021940100705251098, 0.0021892209770157933, 0.0021959079895168543, 0.0022004260681569576, 0.0021929129725322127, 0.002190544968470931]
As you can see, FP16 and INT8 are very similar. The difference in your benchmark appears to be quite small as well. Is this expected? Some people have reported a significant speed-up using INT8 (e.g. Accelerating Large-Scale Object Detection with TensorRT | NVIDIA Technical Blog )