Failed to use INT8 precision mode when using tf-trt on Xavier

Hi,

I failed to use INT8 precision mode when using tf-trt on Jetson AGX Xavier, could anyone give me some advice?

I write the inference code for my own tensorflow models following the guide
https://github.com/tensorflow/tensorrt/blob/r1.14%2B/tftrt/examples/image-classification/TF-TRT-inference-from-saved-model.ipynb

It work well when I use FP32 and FP16 precision modes,but failed with INT8 precision mode.

The error shows:
2019-12-04 16:57:59.996310: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:733] Number of TensorRT candidate segments: 2
2019-12-04 16:58:00.038247: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-12-04 16:58:00.310446: F ./tensorflow/compiler/tf2tensorrt/convert/convert_nodes.h:296] Check failed: is_weights()
Aborted (core dumped)

And I find the error raised by using function trt.create_inference_graph.

trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
outputs=output_names,
max_batch_size=1,
max_workspace_size_bytes=1*(10**9),#1 << 25,
precision_mode= INT8,#FP32, FP16, INT8
minimum_segment_size=7)

So does the package tf-trt of JetPack 4.2.2 support INT8 precision mode? If yes, please help me to fix the problem above.

By the way, my model includes batchmatmul layer which is not supported by pure TensorRT, and writing a plugin is difficult to me, so I probably won’t use pure TensorRT.

SDK: JetPack 4.2.2
CUDA version: 10.0.326
Python version: 3.6.8
Tensorflow version: 1.14.0
TensorRT version: 5.1.6.1

Thanks.

Hi,

Which TensorFlow package do you use?
If you are not using our prebuilt package, would you mind to give it a try?
https://docs.nvidia.com/deeplearning/frameworks/install-tf-jetson-platform/index.html

Thanks.

Hi,

Thanks for your prompt reply!
Following your guide, I reinstall tensorflow 1.13.1 (using the command below)

sudo pip3 install --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v42 tensorflow-gpu==1.13.1+nv19.3

And now INT8 precision mode raise no errors! Thank you very much!!!

But I meet another problem which is about inference time.
preparations:
(1) I have maximized the device performance using command “sudo nvpmodel -m 0” and “sudo jetson_clocks”)
(2) Then I have implemented the additional calibration step for INT8 mode following the guide https://github.com/tensorflow/tensorrt/blob/r1.14%2B/tftrt/examples/image-classification/TF-TRT-inference-from-saved-model.ipynb

I test my own model (backbone is mobilenet 25) using tf-trt on Xavier platform, and find the inference speed on FP16 is indeed faster than on FP32, but INT8 is much slower which is unreasonable.

And then I think maybe the parameters of building TensorRT engine will influence the inference speed. So I try to adjust the min_segment_size in function trt.create_inference_graph, and test on FP32, FP16 and INT8 mode.

trt_graph = trt.create_inference_graph(
            input_graph_def=frozen_graph,
            outputs=output_names,
            max_batch_size=1,
            max_workspace_size_bytes=1*(10**9),
            precision_mode= INT8,
            <b>minimum_segment_size=7</b>)

Below is the average inference time. (And the inference time without tf-trt is 8.38ms.)

'                                                         average inference time/ms
min_segment_size    number of TensorRT candidate segments      FP32   FP16   INT8
10				     2			       7.73   7.42   10.47
7				     3			       7.61   7.28   10.88
3				     29			       8.06   7.76   <b>201.82</b>

Could you please tell me whether some crucial steps for using INT8 mode I miss?
Looking forward to your reply!

Thanks!

Hi,

Please noticed that not all the layers are running with TensorRT in TF-TRT.
For the non-supported layers, it will inference with TensorFlow implementation instead.

Based on your performance result between FP32 and FP16, the acceleration looks quite limited.
Most of layers looks still using TensorFlow implementation since the speed-up is quite limited.

So the problem is that if the inference is switching from TRT and TF frequently.
It may need to apply quantization and de-quantization for each swtich and lower the performance.

Thanks.