SSD-MobilenetV2 bad performance on XavierNX using Tensorflow + TF_TRT

I tested the performance of Xavier NX in connection with Tensorflow, TF-TRT, OpenCV and the SSD-MobilenetV2 pretrained on the COCO dataset and was quite disappointed. I only get 10fps with the sample video attached. The GPU does not seem to be heavily loaded.

Installed Tensorflow 1.15 according to Official TensorFlow for Jetson AGX XavierNX
Installed OpenCV with CUDA support
Installed everything else according to How to configure your NVIDIA Jetson Nano for Computer Vision and Deep Learning - PyImageSearch
Created an optimized TensorRT graph
Attached: Used Scripts and the according terminal output, the Sample video and the jtop-Info Screenshot
detect_realtime_nano.py (7.4 KB)
jtop_Info
Output_detect_realtime_nano.txt (6.0 KB)
Output_prepare_trt_graph.txt (28.3 KB)
prepare_trt_graph.py (2.2 KB)

Here is demo where you can see the jetson jtop stats during the inference: https://share.icloud.com/photos/0O0SXTp9PBkj8SikiV3y-nvuw

What am I doing wrong? Or can someone confirm that this is the maximum performance of the XavierNX with this framework?

Hi,

Based on the log below:

2021-05-19 20:11:38.116810: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:486] There are 1850 ops of 29 different types in the graph that are not converted to TensorRT: Fill, Merge, Switch, Range, ConcatV2, ZerosLike, Identity, NonMaxSuppressionV3, Minimum, StridedSlice, ExpandDims, Unpack, TopKV2, Cast, Transpose, Placeholder, ResizeBilinear, Squeeze, Mul, Sub, Const, Greater, Shape, Where, Reshape, NoOp, GatherV2, AddV2, Pack, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).

There are lots of operations fallback to use TensorFlow implementation.
The data transfer cost increases if the inference is frequently switching between TensorFlow and TensorRT.

Is TensorFlow interface essential for you?
If not, it’s recommended to convert the model into TensorRT engine for an optimal performance.

In our benchmark result, pure TensorRT inference for SSD Mobilenet-V1 can reach 909 fps.
So it’s expected that you can get a much better result than using TF-TRT.

Thanks.

There are lots of operations fallback to use TensorFlow implementation.
why is that? I’m not doing anything special, just converting the standard mobilenet model.

Is TensorFlow interface essential for you?
If not, it’s recommended to convert the model into TensorRT engine for an optimal performance.
I thought that is what I do using TF-TRT.
I want to perform transfer learning later, using a pretrained standard model and adding additional trainable layers. As I understand I need to use a framework like TF for this, What would be the recommended way to do it?
And can you confirm that 10fps is really the maximum performance of the SSD-MobilenetV2 using Tensorflow on the XavierNX even after optimizing?

would be very grateful for an answer

Hi,

Please noted that TF-TRT uses the parser that embedded in the TensorFlow GitHub.
And the support matrix is relatively limited. Please find the details below:
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops

It’s more recommended to separate training and inference stage.
You can deploy a model with pure TensorRT as well as training it with TensorFlow.

Since pure TensorRT can reach much better performance on SSD Mobilenet-V1.
It’s recommended to move to pure TensorRT instead.

Thanks.