Same inference speed for INT8 and FP16

I am currently benchmarking ResNet50 in FP32, FP16 and INT8 using the python API of TensorRT5 on a V100 GPU. FP32 is twice as slow as FP16, as expected. But FP16 has the same speed as INT8. Any idea why that would be?

I profiled my code both with timeit.default_timer and nvprof with a synchronous execution. The nvprof profiling shows that the kernels used for the INT8 inference are indeed INT8 kernels. Also, the serialized engine file for INT8 is twice smaller than the one for FP16. So I assume the engine is well configured for INT8 computation.

Apart in the engine builder, is there a flag to set somewhere to enable INT8 computation on the GPU?

Hello,

To help us debug, it’d help if you can provide a small repro package that demonstrates the performance issue you are seeing.

regards,
NVIDIA Enterprise Support

Hello,

You can find a minimal source code, the ONNX model I used and a few test images here: https://drive.google.com/open?id=1BYZmFwyIASw2ErGyUKM4egwF5B0E9Uv7

Most of it is actually copied/pasted from the official samples and from https://devblogs.nvidia.com/int8-inference-autonomous-vehicles-tensorrt/ . The ONNX model and the test images also come from the official samples.

Hello,

On a DGX-1V32gb, I’m getting the following results:

root@13434b216637:/mnt/test/test/src# python run.py
Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.005344389006495476, 0.005318833980709314, 0.005320428987033665, 0.005316068010870367, 0.005325147008989006, 0.005304705002345145, 0.005319653020706028, 0.005300264980178326, 0.005298828997183591, 0.005295721988659352]
root@13434b216637:/mnt/test/test/src# python run.py
Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.0024232419673353434, 0.0023866459960117936, 0.0023892400204204023, 0.0023858729982748628, 0.0023826140095479786, 0.0023832329898141325, 0.002387335989624262, 0.0023838509805500507, 0.0023952049668878317, 0.002378040982875973]
root@13434b216637:/mnt/test/test/src# python run.py
Loading and serializing ONNX model...
Build engine...
[ImageBatchStream] Processing  /mnt/test/test/src/../data/binoculars.jpeg
[ImageBatchStream] Processing  /mnt/test/test/src/../data/mug-cc0.jpeg
[ImageBatchStream] Processing  /mnt/test/test/src/../data/canon-cc0.jpeg
[ImageBatchStream] Processing  /mnt/test/test/src/../data/tabby_tiger_cat.jpg
Loading engine and infer...
Allocating buffers ...
[0.002109810011461377, 0.0020720039610750973, 0.002073671028483659, 0.002074009971693158, 0.002069698995910585, 0.0020685330382548273, 0.0020681119640357792, 0.002071729046292603, 0.0020731260301545262, 0.00207544700242579]

Is this similar to what you are seeing? INT8 has slight performance, but not dramatically as from 32 to 16.

The results I get on a Tesla V100-DGXS-16GB:

FP32:

Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.00536393909715116, 0.005346858990378678, 0.0053372320253401995, 0.005339278024621308, 0.005336352973245084, 0.005338195012882352, 0.005339709925465286, 0.005349859944544733, 0.005346192046999931, 0.005334021989256144]

FP16:

Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.002218535984866321, 0.002207414945587516, 0.0021951169474050403, 0.00219749310053885, 0.002194416942074895, 0.002198559930548072, 0.0021935830591246486, 0.0021958909928798676, 0.002192566986195743, 0.0021922640735283494]

INT8:

Loading and serializing ONNX model...
Build engine...
[ImageBatchStream] Processing  /tmp/src/../data/binoculars.jpeg
[ImageBatchStream] Processing  /tmp/src/../data/mug-cc0.jpeg
[ImageBatchStream] Processing  /tmp/src/../data/canon-cc0.jpeg
[ImageBatchStream] Processing  /tmp/src/../data/tabby_tiger_cat.jpg
Loading engine and infer...
Allocating buffers ...
[0.0022265249863266945, 0.002199487993493676, 0.002194601926021278, 0.002190993051044643, 0.0021940100705251098, 0.0021892209770157933, 0.0021959079895168543, 0.0022004260681569576, 0.0021929129725322127, 0.002190544968470931]

As you can see, FP16 and INT8 are very similar. The difference in your benchmark appears to be quite small as well. Is this expected? Some people have reported a significant speed-up using INT8 (e.g. https://devblogs.nvidia.com/large-scale-object-detection-tensorrt/ )

we are triaging now. will keep you updated.

Hello,

Per engineering

You can try builder.strict_type_constraints=true to force Int8 kernels. (Note: This will not help with performance improvement and should be used only for debugging purpose)

It is possible that int8 kernels are chosen with fp32 fallback path (performance hit is considerable for batch_size < 8)
One possible solution is to allow for fp16 fall back path as well.

Changes:
if builder.int8_mode:
builder.fp_16_mode = true

This should improve performance.
You can also turn on logger to print the layer execution precision for all the layers. If int8 mode, its possible that tensorrt chooses fp32 kernel (due to unavailable int8 kernels or slower int8 path i.e. int8 kernel + reformat > fp32 kernel) for few layers.

The above change to enable fp16_mode should enable fp16+reformat path.
TensorRT will choose the fastest possible path.

Thanks for your answer.

For batch 4, most of the computation time is spent on int8 kernels, both when falling back to fp32 or fp16. And there is no difference between int8 and fp16 (even if falling back to fp16 when running int8).

For batch 32, two strange things:

  1. instructing TensorRT to fall back to fp16 yields different kernels than when falling back to fp16 for batch 4. Is that expected?
  2. most of the computation time is spent on
  • int8 kernels when fall back is fp32 (fp32 kernels computation time is very small comparatively)
  • int8 and some fp16 kernels when fall back is fp16
    From this observation, it seems that int8 with fp32 fallback should be faster, because it mainly uses int8 kernels. But, in practice, int8 with fp16 fall back is faster. Is this expected?

You can find in attachment the nvprof profiles for both fp32, fp16, int8 with fp32 fallback (trt8.nvprof) and int8 with fp16 fallback (trt8-fallback.nvprof) here : https://drive.google.com/open?id=1YknajwNMbNNsI2DyPg0BUkEcP4szBT0U https://drive.google.com/open?id=1g50LK1BT7c7AvNnUiDD3fD6iC3rZz1QV

Hello,

Per engineering:

  1. instructing TensorRT to fall back to fp16 yields different kernels than when falling back to fp16 for batch 4. Is that expected?

Yes this is expected. Auto-tuning will choose fastest available kernel.

  1. From this observation, it seems that int8 with fp32 fallback should be faster, because it mainly uses int8 kernels. But, in practice, int8 with fp16 fall back is faster. Is this expected?

[b]Again, TensoRT will choose kernels based on heuristic i.e. choose fastest path. One thing to consider here are reformats i.e. copy operation. Its possible that Int8 --> fp32 --> int8 reformat time >> int8 --> fp16 --> int8 reformat time.

When we say “fp32 kernels computation time is very small comparatively” which implies most of the computation is int8 + reformat from int8 to fp32. Its possible int8 + reformat int8 to fp16 time is less than int8 + reformat int8 to fp32.[/b]

You mentioned “You can also turn on logger to print the layer execution precision for all the layers.”

Could you elaborate on how to print the layer precision during inference?
I tried

builder->setDebugSync(true);
...
context->setDebugSync(true);
context->execute(...); // synchronized execution

But nothing happens. I also tried to add those lines to the sample sampleOnnxMNIST (and ensure synchronous execution), but still nothing.

Is it correct to expect these instructions to print layer execution precision? Or is there another way?