Deploy Object Detection TF-TRT INT8 with DS Triton

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) T4
• DeepStream Version DeepStream Triton via container deepstream:5.0.1-20.09-triton
• JetPack Version (valid for Jetson only)
• TensorRT Version * TensorRT 7.0.
• NVIDIA GPU Driver Version (valid for GPU only) 450.51.06
• Issue Type( questions, new requirements, bugs)

I need to deploy the optimized model TF-TRT INT8 faster_rcnn_inception_v2_coco_2018_01_28 using DeepStream-Triton container. I am using as an example this blog https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/, but the referenced script doesn’t include the option to optimize the model as TF-TRT INT8

What script is recommended to convert the model to TF-TRT INT8?. I have used this script https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/object_detection/object_detection.py and seeing performance degradation

hI @virsg,
DS-Triton doesn’t support TF-TRT INT8 online build, only FP32/FP16 supported.
But DS-Triton can support offline prebuilt TF-TRT INT8 model files, that is, you can refer to https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html to build INT8 saved model, and pass this saved model to DS-Triton (dsnvinferserver).
Note, current DS (DS5.x) only supports TF1.X.

hi @mchi, in fact I created the offline prebuilt TF-TRT INT8 and passed the saved model to DS-Triton (dsnvinferserver) but I am seeing performance degradation. To build INT8 model I used the below script https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/object_detection/object_detection.py which implements the building part of it, and the docker image nvcr.io/nvidia/tensorflow:20.02-tf2-py3 (since with TF1.X. the script threw errors):
image

What do you think could be the reason for the performance degradation?, building the model with TF2.X instead of TF1x.X?, see below deployment performance with Streams=1, BS=4, Count instance=1
TF FP32: 21 fps
TF-TRT FP16: 55 fps
TF-TRT INT8: 34 fps

how about the perf if you just user TF-TRT to do the infer ?

The scripts I am using https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/object_detection/object_detection.py is showing this performance for TF-TRT INT8: images/sec: 45.

What script do you recommend to convert the object detection model to TF-TRT INT8 with NMS implementation?

The script should be fine.

I think the possible reason of INT8 slower than FP32 is, with INT8 on TRT and FP32 on TF, there are extra format conversion comparting FP32 on TRT and TF.

To dig out more clues about the perf difference, I think

  1. use tensorboard to check if the same layers running on TF and TRT for INT8 and FP32,
    or you may could find the information also in the build verbose log
  2. use Nsight system to profile the inference part to find out the details about INT8 is slower than FP32.

And, note

  1. As the perf data in https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#verified-models, TF-TRT INT8 is not always faster than FP32
  2. Current DeepStream only support TF1.x
  3. TRT supports NMS, is it possible to convert your model to ONNX to run with TRT?

Hi @mchi, I was able to optimize the model faster_rcnn_inception_v2 model to TF-TRT INT8 with NMS enabled (ops placed on the CPU) using TF 1.5.2 and the script https://github.com/tensorflow/tensorrt/tree/r1.14+/tftrt/examples/object_detection. So I got performance improvement with nms enable vs nms disable:
TF-TRT-INT8 (nms enabled): ~96FPS
TF-TRT-INT8 (no nms): ~43 FPS

The model was optimized with batch_size=8, image_shape=[600, 600], and minimum_segment_size=50. For DS-Triton deployment the max_batch_size=8

The issue is now when deploying the model to DeepStream-Triton, I got the below error Input shape axis 0 must equal 8, got shape [5,600,1024,3] (even though the model was optimized with BS=8):

I0112 01:06:22.313573 2643 model_repository_manager.cc:837] successfully loaded 'faster_rcnn_inception_v2' version 13
INFO: infer_trtis_backend.cpp:206 TrtISBackend id:1 initialized model: faster_rcnn_inception_v2
2021-01-12 01:06:36.202139: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for TRTEngineOp_0 input shapes: [[8,600,1024,3]]
2021-01-12 01:06:36.202311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2021-01-12 01:06:36.203128: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
2021-01-12 01:09:20.678239: W tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:37] DefaultLogger Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2021-01-12 01:09:20.709545: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for TRTEngineOp_1 input shapes: [[800,14,14,576]]
2021-01-12 01:10:01.273658: W tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:37] DefaultLogger Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles

Runtime commands:
        h: Print this help
        q: Quit

        p: Pause
        r: Resume


**PERF:  FPS 0 (Avg)
**PERF:  0.00 (0.00)
** INFO: <bus_callback:181>: Pipeline ready

** INFO: <bus_callback:167>: Pipeline running

ERROR: infer_trtis_server.cpp:276 TRTIS: failed to get response status, trtis_err_str:INTERNAL, err_msg:2 root error(s) found.
  (0) Invalid argument: Input shape axis 0 must equal 8, got shape [5,600,1024,3]
         [[{{node Preprocessor/unstack}}]]
  (1) Invalid argument: Input shape axis 0 must equal 8, got shape [5,600,1024,3]
         [[{{node Preprocessor/unstack}}]]
         [[ExpandDims_4/_199]]
0 successful operations.
0 derived errors ignored.
ERROR: infer_trtis_backend.cpp:532 TRTIS server failed to parse response with request-id:1 model:
0:03:46.539871495  2643 0x7f0cf80022a0 WARN           nvinferserver gstnvinferserver.cpp:519:gst_nvinfer_server_push_buffer:<primary_gie> error: inference failed with unique-id:1
ERROR from primary_gie: inference failed with unique-id:1
Debug info: gstnvinferserver.cpp(519): gst_nvinfer_server_push_buffer (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie
Quitting
ERROR: infer_trtis_server.cpp:276 TRTIS: failed to get response status, trtis_err_str:INTERNAL, err_msg:2 root error(s) found.
  (0) Invalid argument: Input shape axis 0 must equal 8, got shape [5,600,1024,3]
         [[{{node Preprocessor/unstack}}]]
  (1) Invalid argument: Input shape axis 0 must equal 8, got shape [5,600,1024,3]
         [[{{node Preprocessor/unstack}}]]
         [[ExpandDims_4/_199]]
0 successful operations.
0 derived errors ignored.
ERROR: infer_trtis_backend.cpp:532 TRTIS server failed to parse response with request-id:2 model:
ERROR from qtdemux0: Internal data stream error.
Debug info: qtdemux.c(6073): gst_qtdemux_loop (): /GstPipeline:pipeline/GstBin:multi_src_bin/GstBin:src_sub_bin0/GstURIDecodeBin:src_elem/GstDecodeBin:decodebin0/GstQTDemux:qtdemux0:
streaming stopped, reason custom-error (-112)
I0112 01:10:01.644682 2643 model_repository_manager.cc:708] unloading: faster_rcnn_inception_v2:13
I0112 01:10:01.917792 2643 model_repository_manager.cc:816] successfully unloaded 'faster_rcnn_inception_v2' version 13
I0112 01:10:01.918447 2643 server.cc:179] Waiting for in-flight inferences to complete.
I0112 01:10:01.918460 2643 server.cc:194] Timeout 30: Found 0 live models and 0 in-flight requests
App run failed

Some recommendation on how to fix the input shape issue?

Sorry for delay!
Still not yet get clear clues about this issue, will continue to check this.
btw, this error can also find from network.

hi @mchi, it seems there is an issue with the TensorFlow Object Detection API producing incomplete input shapes when exporting the graph, reported issue at https://github.com/tensorflow/models/issues/6159

I need to optimize the model as INT8 with NMS Ops placed on the CPU, and deploy it with DS-Triton, what do you recommend me?