Same inference speed for INT8 and FP16

ephore · November 3, 2018, 8:58pm

I am currently benchmarking ResNet50 in FP32, FP16 and INT8 using the python API of TensorRT5 on a V100 GPU. FP32 is twice as slow as FP16, as expected. But FP16 has the same speed as INT8. Any idea why that would be?

I profiled my code both with timeit.default_timer and nvprof with a synchronous execution. The nvprof profiling shows that the kernels used for the INT8 inference are indeed INT8 kernels. Also, the serialized engine file for INT8 is twice smaller than the one for FP16. So I assume the engine is well configured for INT8 computation.

Apart in the engine builder, is there a flag to set somewhere to enable INT8 computation on the GPU?

NVES · November 4, 2018, 4:49pm

Hello,

To help us debug, it’d help if you can provide a small repro package that demonstrates the performance issue you are seeing.

regards,
NVIDIA Enterprise Support

ephore · November 5, 2018, 9:38am

Hello,

You can find a minimal source code, the ONNX model I used and a few test images here: https://drive.google.com/open?id=1BYZmFwyIASw2ErGyUKM4egwF5B0E9Uv7

Most of it is actually copied/pasted from the official samples and from https://devblogs.nvidia.com/int8-inference-autonomous-vehicles-tensorrt/ . The ONNX model and the test images also come from the official samples.

NVES · November 5, 2018, 7:32pm

Hello,

On a DGX-1V32gb, I’m getting the following results:

root@13434b216637:/mnt/test/test/src# python run.py
Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.005344389006495476, 0.005318833980709314, 0.005320428987033665, 0.005316068010870367, 0.005325147008989006, 0.005304705002345145, 0.005319653020706028, 0.005300264980178326, 0.005298828997183591, 0.005295721988659352]
root@13434b216637:/mnt/test/test/src# python run.py
Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.0024232419673353434, 0.0023866459960117936, 0.0023892400204204023, 0.0023858729982748628, 0.0023826140095479786, 0.0023832329898141325, 0.002387335989624262, 0.0023838509805500507, 0.0023952049668878317, 0.002378040982875973]
root@13434b216637:/mnt/test/test/src# python run.py
Loading and serializing ONNX model...
Build engine...
[ImageBatchStream] Processing  /mnt/test/test/src/../data/binoculars.jpeg
[ImageBatchStream] Processing  /mnt/test/test/src/../data/mug-cc0.jpeg
[ImageBatchStream] Processing  /mnt/test/test/src/../data/canon-cc0.jpeg
[ImageBatchStream] Processing  /mnt/test/test/src/../data/tabby_tiger_cat.jpg
Loading engine and infer...
Allocating buffers ...
[0.002109810011461377, 0.0020720039610750973, 0.002073671028483659, 0.002074009971693158, 0.002069698995910585, 0.0020685330382548273, 0.0020681119640357792, 0.002071729046292603, 0.0020731260301545262, 0.00207544700242579]

Is this similar to what you are seeing? INT8 has slight performance, but not dramatically as from 32 to 16.

ephore · November 5, 2018, 8:21pm

The results I get on a Tesla V100-DGXS-16GB:

FP32:

Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.00536393909715116, 0.005346858990378678, 0.0053372320253401995, 0.005339278024621308, 0.005336352973245084, 0.005338195012882352, 0.005339709925465286, 0.005349859944544733, 0.005346192046999931, 0.005334021989256144]

FP16:

Loading and serializing ONNX model...
Build engine...
Loading engine and infer...
Allocating buffers ...
[0.002218535984866321, 0.002207414945587516, 0.0021951169474050403, 0.00219749310053885, 0.002194416942074895, 0.002198559930548072, 0.0021935830591246486, 0.0021958909928798676, 0.002192566986195743, 0.0021922640735283494]

INT8:

Loading and serializing ONNX model...
Build engine...
[ImageBatchStream] Processing  /tmp/src/../data/binoculars.jpeg
[ImageBatchStream] Processing  /tmp/src/../data/mug-cc0.jpeg
[ImageBatchStream] Processing  /tmp/src/../data/canon-cc0.jpeg
[ImageBatchStream] Processing  /tmp/src/../data/tabby_tiger_cat.jpg
Loading engine and infer...
Allocating buffers ...
[0.0022265249863266945, 0.002199487993493676, 0.002194601926021278, 0.002190993051044643, 0.0021940100705251098, 0.0021892209770157933, 0.0021959079895168543, 0.0022004260681569576, 0.0021929129725322127, 0.002190544968470931]

As you can see, FP16 and INT8 are very similar. The difference in your benchmark appears to be quite small as well. Is this expected? Some people have reported a significant speed-up using INT8 (e.g. Accelerating Large-Scale Object Detection with TensorRT | NVIDIA Technical Blog )

NVES · November 5, 2018, 8:49pm

we are triaging now. will keep you updated.

NVES · November 17, 2018, 11:47pm

Hello,

Per engineering

You can try builder.strict_type_constraints=true to force Int8 kernels. (Note: This will not help with performance improvement and should be used only for debugging purpose)

It is possible that int8 kernels are chosen with fp32 fallback path (performance hit is considerable for batch_size < 8)
One possible solution is to allow for fp16 fall back path as well.

Changes:
if builder.int8_mode:
builder.fp_16_mode = true
…

This should improve performance.
You can also turn on logger to print the layer execution precision for all the layers. If int8 mode, its possible that tensorrt chooses fp32 kernel (due to unavailable int8 kernels or slower int8 path i.e. int8 kernel + reformat > fp32 kernel) for few layers.

The above change to enable fp16_mode should enable fp16+reformat path.
TensorRT will choose the fastest possible path.

ephore · November 22, 2018, 5:16am

Thanks for your answer.

For batch 4, most of the computation time is spent on int8 kernels, both when falling back to fp32 or fp16. And there is no difference between int8 and fp16 (even if falling back to fp16 when running int8).

For batch 32, two strange things:

instructing TensorRT to fall back to fp16 yields different kernels than when falling back to fp16 for batch 4. Is that expected?
most of the computation time is spent on

int8 kernels when fall back is fp32 (fp32 kernels computation time is very small comparatively)
int8 and some fp16 kernels when fall back is fp16
From this observation, it seems that int8 with fp32 fallback should be faster, because it mainly uses int8 kernels. But, in practice, int8 with fp16 fall back is faster. Is this expected?

You can find in attachment the nvprof profiles for both fp32, fp16, int8 with fp32 fallback (trt8.nvprof) and int8 with fp16 fallback (trt8-fallback.nvprof) here : [url]https://drive.google.com/open?id=1YknajwNMbNNsI2DyPg0BUkEcP4szBT0U[/url] [url]https://drive.google.com/open?id=1g50LK1BT7c7AvNnUiDD3fD6iC3rZz1QV[/url]

NVES · December 14, 2018, 10:12pm

Hello,

Per engineering:

instructing TensorRT to fall back to fp16 yields different kernels than when falling back to fp16 for batch 4. Is that expected?

Yes this is expected. Auto-tuning will choose fastest available kernel.

From this observation, it seems that int8 with fp32 fallback should be faster, because it mainly uses int8 kernels. But, in practice, int8 with fp16 fall back is faster. Is this expected?

[b]Again, TensoRT will choose kernels based on heuristic i.e. choose fastest path. One thing to consider here are reformats i.e. copy operation. Its possible that Int8 → fp32 → int8 reformat time >> int8 → fp16 → int8 reformat time.

When we say “fp32 kernels computation time is very small comparatively” which implies most of the computation is int8 + reformat from int8 to fp32. Its possible int8 + reformat int8 to fp16 time is less than int8 + reformat int8 to fp32.[/b]

ephore · March 8, 2019, 5:47pm

You mentioned “You can also turn on logger to print the layer execution precision for all the layers.”

Could you elaborate on how to print the layer precision during inference?
I tried

builder->setDebugSync(true);
...
context->setDebugSync(true);
context->execute(...); // synchronized execution

But nothing happens. I also tried to add those lines to the sample sampleOnnxMNIST (and ensure synchronous execution), but still nothing.

Is it correct to expect these instructions to print layer execution precision? Or is there another way?

Topic		Replies	Views
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2077	November 11, 2021
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	838	October 18, 2021
Why is' int8 'not as fast as' fp16' TensorRT tensorrt	1	622	February 1, 2021
Little performance difference between int8 and fp16 on RTX2080 TensorRT	4	2706	July 5, 2021
The inference speed of yolov5 tensorrt has little difference between int8 and fp16 TensorRT tensorrt , cuda	1	1603	September 8, 2022
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2447	January 6, 2022
Same inference speed with Resnet50 for int8 and fp16 Jetson Xavier NX jetson-inference	4	765	October 18, 2021
TensorRT int8 slower than FP16 due to reformat layer TensorRT tensorrt , cudnn	0	172	October 11, 2024
YoloV4 slower in INT8 than FP16 TensorRT	5	1638	June 5, 2021
Inference Speed Jetson Xavier NX pytorch	6	1002	April 12, 2023

Same inference speed for INT8 and FP16

Related topics