Decreased performance from FP16 to INT8 in TF-TRT on Jetson Xavier

Jagoul · February 15, 2019, 2:07pm

Hello all,

I implemented recently the TF-TRT application provided by NVIDIA for FP16 inference on Jetson Xavier using the following link:

https://github.com/NVIDIA-AI-IOT/tf_trt_models

The average runtime was 22.5 ms, which is pretty good on the Xavier

The ultimate goal is to implement TF-TRT on INT8 bit inference with a reasonable speed. So I re-implemented the same repo with INT8 inference then the average runtime becomes drastically slower 2175ms almost 2 seconds

Could you please suggest a solution to increase the speed or performance. I am using the ssd_mobilenetv1_coco for both implementation.

regards,
Raed

AastaLLL · February 18, 2019, 1:53am

Hi,

1) First, please remember to maximize the device performance:

sudo ./jetson_clocks.sh

2) How do you re-implement it for INT8 inference?
https://github.com/NVIDIA-AI-IOT/tf_trt_models#optimize-with-tensorrt

You can change it into INT8 by updating the configuration directly:

import tensorflow.contrib.tensorrt as trt

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=output_names,
    max_batch_size=1,
    max_workspace_size_bytes=1 << 25,
    <b>precision_mode='INT8',</b>
    minimum_segment_size=50
)

3) For Xavier, you can also try to use DLA to offload the GPU loading.
But DLA is not enabled in the TensorFlow yet, you will need to use pure TensorRT to access it.

Thanks.

Jagoul · February 18, 2019, 5:05pm

Hello AastaLLL,

For the re-implementation of tf-trt I did the following:

yes I executed the script ./jetson_clocks.sh
I modified trt graph and put the value INT8 instead FP 16 but the model becomes extremely slow and I got a runtime equals to 2175ms. My expectation is to have a real-time object detection application capable of processing images in the window time [90ms to 110 ms]
Can you please elaborate on this point, how to enable the DLA using pure TensorRT framework to access it??

is there another trick that can be done in order to decrease the average runtime?

Thanks for your guidance

AastaLLL · February 25, 2019, 3:48am

Hi,

This is beyond our expectation.
We will compare the performance between INT8 and FP16 for ssd_mobilenetv1_coco.

With update more information with you later.
Thanks.

Jagoul · February 25, 2019, 1:50pm

Hello AstaLLL,

Thank you for the swift reply, I will be waiting for updates about this matter. Meanwhile, I will try to test the model with different hyperparameters and maybe play around with the architecture.

Regards,
Raed

AastaLLL · March 5, 2019, 8:39am

Hi,

Could you help to enable the device placement and share with us?

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Thanks.

Jagoul · March 5, 2019, 1:27pm

Hi AastaLLL,

I have python3 install with tensorflow 1.12 installed on the Xavier , so my configuration went as follows :

nvidia@jetson-0423618000780:~/Projects/4.TF_TRT_models$ python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-03-05 08:31:31.646708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2019-03-05 08:31:31.647097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.46GiB freeMemory: 323.72MiB
2019-03-05 08:31:31.647210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-05 08:31:34.132172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-05 08:31:34.132575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-03-05 08:31:34.132696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-03-05 08:31:34.133256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 146 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2
2019-03-05 08:31:34.135448: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2

regards,
Raed

AastaLLL · March 11, 2019, 5:19am

Hi,

We are able to reproduce this issue in our side.
With python 2.7, we got 114ms for FP16 but 2158ms for INT8.

Will update information with you once we find something.
Thanks.

Jagoul · March 11, 2019, 6:25pm

Thank you for the follow-up. I will be waiting for your results as soon as you have a breakthrough.

regards,
Raed

AastaLLL · July 24, 2019, 3:11am

Hi,

We got some feedback from our internal team.

The script is using INT8 mode incorrectly.
INT8 mode in TF-TRT requires an additional calibration step. You are actually measuring the performance of the calibration graph.

The workflow for doing INT8 inference in TF-TRT for TF1.13 is as follows:

Create calibration graph: calib_graph = trt.create_inference_graph(frozen_graph, precision_mode='INT8', ...)
Create session and load calib graph
Run inference on small set of images using calib graph (10-500 images)
Convert calib graph to inference graph: trt_graph = trt.calib_graph_to_infer_graph(calib_graph)
Create session and load inference graph
Run inference

Here are our TF-TRT object detection examples for your reference:
https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection

Thanks.

cpchiu · February 20, 2020, 2:43am

AastaLLL:

Hi,

We got some feedback from our internal team.

The script is using INT8 mode incorrectly.
INT8 mode in TF-TRT requires an additional calibration step. You are actually measuring the performance of the calibration graph.

The workflow for doing INT8 inference in TF-TRT for TF1.13 is as follows:

Create calibration graph: calib_graph = trt.create_inference_graph(frozen_graph, precision_mode='INT8', ...)

Create session and load calib graph

Run inference on small set of images using calib graph (10-500 images)

Convert calib graph to inference graph: trt_graph = trt.calib_graph_to_infer_graph(calib_graph)

Create session and load inference graph

Run inference

Here are our TF-TRT object detection examples for your reference:
https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection

Thanks.

Are we just need to run the inference without doing any extra thing for Tensorflow 1.13 ? Thanks
It would be great if we could have the more detailed code for 1.13 to do the inference on calibration graph. As there is no material for Tensorflow 1.13 or below but only 1.14 , 1.15 and 2.0.

kjasper · October 15, 2020, 11:09pm

But DLA is not enabled in the TensorFlow yet, you will need to use pure TensorRT to access it.

Is there a timeline for NVDLA support in tftrt?

Topic		Replies	Views
Failed to use INT8 precision mode when using tf-trt on Xavier Jetson AGX Xavier	4	968	October 18, 2021
Lower performance with TRT than plain TF? Jetson Xavier NX tensorrt , jetson-inference	14	1955	October 18, 2021
converting a frozen graph to tensorRT Jetson Nano	5	1788	October 14, 2021
Calibration failed: INTERNAL: Failed to build TensorRT engine (INT8 precision mode) in Jetson Xavier NX (16GB) Jetson Xavier NX tensorrt	9	751	April 12, 2023
No speed up tensorrt model in inference (xavier) Jetson AGX Xavier tensorrt	4	624	October 18, 2021
TF-TRT optimization TensorRT tensorrt , tensorflow , jetson-inference	4	4948	June 2, 2021
Unable to verify Xavier inference benchmarks Jetson AGX Xavier	17	2279	October 18, 2021
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1695	October 18, 2021
optimizing tf-trt load time Jetson Nano	12	4175	October 15, 2021
TensorRT INT8 inference is slower than FP16 in models with conditional flow Jetson Orin Nano tensorrt , cuda , jetson-inference , onnx	5	1051	June 10, 2024

Decreased performance from FP16 to INT8 in TF-TRT on Jetson Xavier

Related topics