Best practice inference of TensorFlow bbject detection models on Jetson devices

Hello there,

We have a Jetson Xavier NX where we want to run Object Detection Models.

My main question is what is the best practice to run “bigger” object detection models at 30 FPS and above.
- Specifically with what framework is it recommended to train models for easiest conversion into the recommended final model format used for inference?
- What kind of models are recommended which are rather “big” like ResNet50_640x640 and still yield 30 FPS or more?
- What final model format is most recommended for Inference on Jetson Boards?
- Is there anything “hidden” to be considered using jetson xavier NX to its full potential?

I figured this shouldnt be to hard looking at the benchmarks from this link: Jetson Benchmark
But ran into a lot of problems trying to that the following way

Since I already know Tensorflow, I trained the following 3 Models from Model Zoo on TensorFlow2:

[SSD ResNet50 V1 FPN 640x640 (RetinaNet50)]
[SSD MobileNet V2 FPNLite 320x320]
[CenterNet Resnet50 V1 FPN 512x512]

For inference on Jetson Devices I read that TensorRT engines would be the way to go for maximum FPS. Therefore I tried to convert them with TF-TRT. With ResNet50 the conversion never worked because of OOM Errors despite me increasing the swap memory to 16GB. I tried to limit the TensorFlow Memory values between 500MB and 4GB and the “max_workspace_size_bytes” to values between 50MB and 2 GB.
For MobileNetV2 the conversion didnt work properly because there wasnt enough Memory for the “Tactics” it wanted to build.
Following: TF-TRT Documentation

Then I tried it with pure TensorRT by converting the TensorFlow models from SavedModel format to ONNX via tf2toONNX since this is the more recommended way. On the ONNX model I ran constant folding using polygraphy and then tried to convert it to tensorRT where I ran into the following error. Unsupported ONNX data type: UINT8. I understand I would have to take the input layer away with the ONNX Graphsurgeon and then try again.
Following: TensorRT Documentation

At this point I just tried to run the models in the saved model format with TensorFlow to see if the models can even be loaded on the Jetson Xavier NX. Even with the smallest model MobileNetV2 I only achieved 12 FPS even though the Jetson Xavier NX has MobileNetV1_300x300 Benchmarked with 909 FPS. This was obviously under perfect conditions with using a int8 quantized TensorRT engine, but nevertheless the difference in FPS here is huge.

I also tried to convert the models into the quantized tflite format for better inference but got stuck on the way there with a segmentation fault.

Inferencing ONNX models I got worse FPS than with the SavedModel Format using TensorFlow. So in a last effort i tried to inference multiple images at once by passing a batch of 4 images in one array and editing the ONNX model input shape to [4,320,320,3] instead of [1,320,320,3] and also with the outputs, using the onnx.tools.update_model_dims.update_inputs_outputs_dims function. This function strangely doesnt allow to change a already set value. That is the batch_size of 1 in my case.

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

We test the SSD Mobilenet-V1 300x300 model from Jetson Benchmarks.
The performance can reach ~503fps on XavierNX + JetPack4.6.1 in batchsize=1 case.

$ wget https://www.dropbox.com/s/gx5zayt76vszhpo/ssd-mobilenet-v1.zip
$ unzip ssd-mobilenet-v1.zip 
$ /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-v1-bs1.onnx --best
$ /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-v1-bs1.onnx --int8
...
[03/14/2022-10:45:12] [I] === Performance summary ===
[03/14/2022-10:45:12] [I] Throughput: 502.808 qps
[03/14/2022-10:45:12] [I] Latency: min = 1.92212 ms, max = 2.01172 ms, mean = 1.97724 ms, median = 1.97717 ms, percentile(99%) = 1.99341 ms
[03/14/2022-10:45:12] [I] End-to-End Host Latency: min = 1.93555 ms, max = 2.02466 ms, mean = 1.98797 ms, median = 1.98796 ms, percentile(99%) = 2.00757 ms
[03/14/2022-10:45:12] [I] Enqueue Time: min = 0.835938 ms, max = 2.04401 ms, mean = 0.927525 ms, median = 0.915039 ms, percentile(99%) = 1.08826 ms
[03/14/2022-10:45:12] [I] H2D Latency: min = 0.0507202 ms, max = 0.0574951 ms, mean = 0.0519587 ms, median = 0.0518799 ms, percentile(99%) = 0.053772 ms
[03/14/2022-10:45:12] [I] GPU Compute Time: min = 1.85437 ms, max = 1.9436 ms, mean = 1.90882 ms, median = 1.90869 ms, percentile(99%) = 1.92505 ms
[03/14/2022-10:45:12] [I] D2H Latency: min = 0.0153809 ms, max = 0.0178223 ms, mean = 0.0164697 ms, median = 0.0164795 ms, percentile(99%) = 0.0170898 ms
[03/14/2022-10:45:12] [I] Total Host Walltime: 3.00313 s
[03/14/2022-10:45:12] [I] Total GPU Compute Time: 2.88232 s
[03/14/2022-10:45:12] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/14/2022-10:45:12] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-v1-bs1.onnx --best

Based on your use case, you might want to try our TAO toolkit together with Deesptream SDK.

TAO allows you to train/fine-tune a model with a custom dataset.
After that, you can convert the model into TensorRT and deploy it with Deepstream.

Please refer to the following link for the details:
NVIDIA TAO Toolkit: TAO Toolkit | NVIDIA Developer
NVIDIA DeepStream SDK: NVIDIA DeepStream SDK | NVIDIA Developer

Thanks.

Hi AstaaLLL

Thank you very much for your quick response!!! I considered your suggestions.

I did not run these commands, I just selected 20W 6core manually because this seemed like the best performing mode. Running these commands changed it to 15W 2 core, this change nothing regarding FPS.

I tested my Xavier with your benchmark and also got around 500 FPS, so everything seems fine with the hardware.

I started to read into TAO and am trying to train a model with it. The documenation is very limited and scattered though.
I used the following documentations: TAO_Launcher, DetectNet_V2 dataset_convert, TAO on AWS
Currently I am stuck trying to run the following command:

tao detectnet_v2 dataset_convert -d workspace/tao-experiments/specs -o workspace/tao-experiments/output

it returns this error:

2022-03-16 15:17:17,620 [INFO] root: Registry: [‘nvcr.io’]
2022-03-16 15:17:17,700 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-03-16 15:17:17,712 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
Traceback (most recent call last):
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/dataset_convert.py”, line 130, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/dataset_convert.py”, line 119, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/dataset_convert.py”, line 110, in main
FileNotFoundError: [Errno 2] No such file or directory: ‘workspace/tao-experiments/specs’

From reading several posts with the same error it seems to be a problem with my .tao_mounts.json file.
The file was not existing at first so added it myself:

{
    "Mounts": [
        {
            "source": "/home/ubuntu/projects/UseCase1/Badge1/DetectNetV2_ResNet34/dataset/",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/ubuntu/projects/UseCase1/Badge1/DetectNetV2_ResNet34/dataset/config/",
            "destination": "/workspace/tao-experiments/specs"
        }
    ]
}

Do you have any suggestions on fixing this error, i.e. getting the TAO launcher to work with the proper .tao_mounts.json file?

With this process being not quite straight forward and documenation hard to understand and find, it would be nice if I could just get my already trained and good working ResNet50 Model to run with good FPS on the jetson Xavier. Do you have any recommendations on how to get good inference with tensorflow models on jetson boards?

Hi,

If your jobs mainly run on GPU, please use MAXN mode for the maximal GPU frequency.

For TAO, please note that the training job is not supported on Jeston.
Please apply the training on a desktop environment and then copy the model to Jetson for inference.

If you already have a TensorFlow model, you might try to change it into ONNX format.
The ONNX based model is supported by the TensorRT so you can deploy it on Jetson directly.

Below is an example that should give you some idea:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.