Decreased performance from FP16 to INT8 in TF-TRT on Jetson Xavier

Hello all,

I implemented recently the TF-TRT application provided by NVIDIA for FP16 inference on Jetson Xavier using the following link:

The average runtime was 22.5 ms, which is pretty good on the Xavier

The ultimate goal is to implement TF-TRT on INT8 bit inference with a reasonable speed. So I re-implemented the same repo with INT8 inference then the average runtime becomes drastically slower 2175ms almost 2 seconds

Could you please suggest a solution to increase the speed or performance. I am using the ssd_mobilenetv1_coco for both implementation.

regards,
Raed

Hi,

1) First, please remember to maximize the device performance:

sudo ./jetson_clocks.sh

2) How do you re-implement it for INT8 inference?
https://github.com/NVIDIA-AI-IOT/tf_trt_models#optimize-with-tensorrt

You can change it into INT8 by updating the configuration directly:

import tensorflow.contrib.tensorrt as trt

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=output_names,
    max_batch_size=1,
    max_workspace_size_bytes=1 << 25,
    <b>precision_mode='INT8',</b>
    minimum_segment_size=50
)

3) For Xavier, you can also try to use DLA to offload the GPU loading.
But DLA is not enabled in the TensorFlow yet, you will need to use pure TensorRT to access it.

Thanks.

Hello AastaLLL,

For the re-implementation of tf-trt I did the following:

  1. yes I executed the script ./jetson_clocks.sh

  2. I modified trt graph and put the value INT8 instead FP 16 but the model becomes extremely slow and I got a runtime equals to 2175ms. My expectation is to have a real-time object detection application capable of processing images in the window time [90ms to 110 ms]

  3. Can you please elaborate on this point, how to enable the DLA using pure TensorRT framework to access it??

is there another trick that can be done in order to decrease the average runtime?

Thanks for your guidance

Hi,

This is beyond our expectation.
We will compare the performance between INT8 and FP16 for ssd_mobilenetv1_coco.

With update more information with you later.
Thanks.

Hello AstaLLL,

Thank you for the swift reply, I will be waiting for updates about this matter. Meanwhile, I will try to test the model with different hyperparameters and maybe play around with the architecture.

Regards,
Raed

Hi,

Could you help to enable the device placement and share with us?

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Thanks.

Hi AastaLLL,

I have python3 install with tensorflow 1.12 installed on the Xavier , so my configuration went as follows :

nvidia@jetson-0423618000780:~/Projects/4.TF_TRT_models$ python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-03-05 08:31:31.646708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2019-03-05 08:31:31.647097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.46GiB freeMemory: 323.72MiB
2019-03-05 08:31:31.647210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-05 08:31:34.132172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-05 08:31:34.132575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-03-05 08:31:34.132696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-03-05 08:31:34.133256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 146 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2
2019-03-05 08:31:34.135448: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2

regards,
Raed

Hi,

We are able to reproduce this issue in our side.
With python 2.7, we got 114ms for FP16 but 2158ms for INT8.

Will update information with you once we find something.
Thanks.

Thank you for the follow-up. I will be waiting for your results as soon as you have a breakthrough.

regards,
Raed

Hi,

We got some feedback from our internal team.

The script is using INT8 mode incorrectly.
INT8 mode in TF-TRT requires an additional calibration step. You are actually measuring the performance of the calibration graph.

The workflow for doing INT8 inference in TF-TRT for TF1.13 is as follows:

  • Create calibration graph: calib_graph = trt.create_inference_graph(frozen_graph, precision_mode='INT8', ...)
  • Create session and load calib graph
  • Run inference on small set of images using calib graph (10-500 images)
  • Convert calib graph to inference graph: trt_graph = trt.calib_graph_to_infer_graph(calib_graph)
  • Create session and load inference graph
  • Run inference

Here are our TF-TRT object detection examples for your reference:
https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection

Thanks.

Are we just need to run the inference without doing any extra thing for Tensorflow 1.13 ? Thanks
It would be great if we could have the more detailed code for 1.13 to do the inference on calibration graph. As there is no material for Tensorflow 1.13 or below but only 1.14 , 1.15 and 2.0.

But DLA is not enabled in the TensorFlow yet, you will need to use pure TensorRT to access it.

Is there a timeline for NVDLA support in tftrt?