Human pose detection model (MoveNet) TensorRT conversion on NVIDIA Jetson

joaquinsd10 · May 30, 2022, 2:52am

I’m working on a project which depends on Deep learning pose-estimation model from TensorFlow’s MoveNet.
We are working with a Jetson Xavier NX Developer kit.

We would like to run the model using TensorRT and for this purpose we tried the following conversion steps:

tflite -> ONNX32 -> ONNX16 -> TensorRT

Conversion from tflite to ONNX was done through PINTO model’s zoo conversion script, here. To convert the model from FP32 to FP16 I used I used the pip package onnxmltools.

Subsequently I ran the trtexec command on the Jetson to convert the ONNX model to TensorRT

/usr/src/tensorrt/bin/trtexec --onnx=model_float16.onnx --saveEngine=model_fp16.trt

However the conversion tool is returning an error, and it seems to be a problem when trying to cast down from INT64 to INT32. The error message refers to a particular node Resize__242 and I provide screenshots of the node information I obtained using a model visualization tool:

Input filename:   model_float16.onnx
ONNX IR version:  0.0.6
Opset version:    11
Producer name:    tf2onnx
Producer version: 1.9.3
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
[2022-05-29 15:02:49    INFO] [MemUsageChange] Init CUDA: CPU +473, GPU +0, now: CPU 489, GPU 1196 (MiB)
[2022-05-29 15:02:50    INFO] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 489 MiB, GPU 1196 MiB
[2022-05-29 15:02:50    INFO] [MemUsageSnapshot] End constructing builder kernel library: CPU 643 MiB, GPU 1238 MiB
Parsing model
[2022-05-29 15:02:50 WARNING] onnx2trt_utils.cpp:370: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[2022-05-29 15:02:50   ERROR] (Unnamed Layer* 111) [Constant]:constant weights has count 0 but 1 was expected
While parsing node number 207 [Cast -> "Resize__242_input_cast_1"]:
--- Begin node ---
input: "roi__271"
output: "Resize__242_input_cast_1"
name: "Resize__242_input_cast1"
op_type: "Cast"
attribute {
  name: "to"
  i: 1
  type: INT
}
--- End node ---
ERROR: ModelImporter.cpp:179 In function parseGraph:
[6] Invalid Node - Resize__242_input_cast1
(Unnamed Layer* 111) [Constant]:constant weights has count 0 but 1 was expected

Resize__242 Node

Is there a problem with my ONNX_FP16 model which causes the error? Is this particular operation not supported by TensorRT? Any tips or help will be greatly appreciated!

kayccc · May 30, 2022, 3:06am

Your topic was posted in the wrong category. I am moving this to the Jetson Xavier NX category for visibility.

AastaLLL · May 30, 2022, 3:22am

Hi,

Since Tensor has a mechanism to cast the model into fp16 or int8, would you mind testing if ONNX32 -> TensorRT works?

Thanks.

joaquinsd10 · May 31, 2022, 6:47am

Hello AsstaLL,

I ran trtexec and passed the ONNX32 model as input. The conversion worked on the Xavier without any errors.

Next I tried to run the model using a python script which I attach to this post. It loads frames from a video, pre-processes the images and performs inference to get the body-landmarks. My scripts is supposed to draw the landmarks on an output frame as well as show the current FPS value.

The FPS is only about 1! It’s very low compared and I hoped to achieve much faster speeds once you convert to TensorRT. It may be something wrong with the way I load the TRT engine and perform the inference? For reference I attach the script that I use.

demo_singlepose_trt.py (6.4 KB)

FPS.py (957 Bytes)

To run it, create a folder called utils and and paste FPS.py python file inside it. You should be able to run python3 demo_singlepose_trt.py. You can change the variable input_fp to point to a suitable video if your choice.

Conversion on x86-64
In parallel I’ve been trying to run the model using my workstation ( x86-64) by converting the model with TensorRT Backend For ONNX on the TensorRT-OSS build container. After converting the ONNX32 model, I try to run the engine by running another container, NCG tensorrt:22.05-py3, however I get the following error:

[05/31/2022-02:53:12] [TRT] [E] 3: Cannot find binding of given name: input
[05/31/2022-02:53:12] [TRT] [E] 3: [executionContext.cpp::setBindingDimensions::928] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::928, condition: mEngine.bindingIndexBelongsToProfile( bindingIndex, mOptimizationProfile, "IExecutionContext::setBindingDimensions")

I’ve been working with both the Jetson board and my workstation and the last section I discuss difficulties in trying to convert MoveNet on a different platform that is not the Jetson Xavier. Do you think I should post it on another Topic?

AastaLLL · June 6, 2022, 6:43am

Hi,

You can use the same topic to check.

Could you profile the network with trtexec first? It should show some inference profiling results.
For example:

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
...
[06/06/2022-06:37:49] [I] === Performance summary ===
[06/06/2022-06:37:49] [I] Throughput: 21456.3 qps
[06/06/2022-06:37:49] [I] Latency: min = 0.032959 ms, max = 0.105469 ms, mean = 0.0356163 ms, median = 0.0354004 ms, percentile(99%) = 0.0411072 ms
[06/06/2022-06:37:49] [I] Enqueue Time: min = 0.0213623 ms, max = 0.0722656 ms, mean = 0.0231733 ms, median = 0.0229492 ms, percentile(99%) = 0.02771 ms
[06/06/2022-06:37:49] [I] H2D Latency: min = 0.00268555 ms, max = 0.0361328 ms, mean = 0.00406139 ms, median = 0.00415039 ms, percentile(99%) = 0.00476074 ms
[06/06/2022-06:37:49] [I] GPU Compute Time: min = 0.027832 ms, max = 0.0783691 ms, mean = 0.0296538 ms, median = 0.029541 ms, percentile(99%) = 0.0322266 ms
[06/06/2022-06:37:49] [I] D2H Latency: min = 0.0012207 ms, max = 0.0146484 ms, mean = 0.00190149 ms, median = 0.00183105 ms, percentile(99%) = 0.00439453 ms
[06/06/2022-06:37:49] [I] Total Host Walltime: 3.00005 s
[06/06/2022-06:37:49] [I] Total GPU Compute Time: 1.90882 s
[06/06/2022-06:37:49] [W] * GPU compute time is unstable, with coefficient of variance = 3.11681%.
[06/06/2022-06:37:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/06/2022-06:37:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/06/2022-06:37:49] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx

It’s also recommended to test it with --fp16 or --int8 as well.

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --fp16
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --int8

More, please remember to maximize your device performance first.
This can be done by running the following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

joaquinsd10 · June 6, 2022, 9:24am

Hello AastaLL,

Using trtexec and running the profiling tests, we get the follwing:

FP16

trtexec --loadEngine=trt_model/mv_lightning_fp16.engine

[06/06/2022-16:53:26] [I] === Performance summary ===
[06/06/2022-16:53:26] [I] Throughput: 497.66 qps
[06/06/2022-16:53:26] [I] Latency: min = 1.8024 ms, max = 15.8726 ms, mean = 1.96301 ms, median = 1.90088 ms, percentile(99%) = 2.04663 ms
[06/06/2022-16:53:26] [I] End-to-End Host Latency: min = 1.80756 ms, max = 15.8884 ms, mean = 1.97097 ms, median = 1.90775 ms, percentile(99%) = 2.17896 ms
[06/06/2022-16:53:26] [I] Enqueue Time: min = 0.938721 ms, max = 16.2527 ms, mean = 1.16295 ms, median = 1.02881 ms, percentile(99%) = 2.83414 ms
[06/06/2022-16:53:26] [I] H2D Latency: min = 0.0211182 ms, max = 0.0859375 ms, mean = 0.0240434 ms, median = 0.0229492 ms, percentile(99%) = 0.0281372 ms
[06/06/2022-16:53:26] [I] GPU Compute Time: min = 1.77563 ms, max = 15.7936 ms, mean = 1.93673 ms, median = 1.87494 ms, percentile(99%) = 2.02173 ms
[06/06/2022-16:53:26] [I] D2H Latency: min = 0.00146484 ms, max = 0.332764 ms, mean = 0.0022402 ms, median = 0.00180054 ms, percentile(99%) = 0.0022583 ms
[06/06/2022-16:53:26] [I] Total Host Walltime: 3.00406 s
[06/06/2022-16:53:26] [I] Total GPU Compute Time: 2.89541 s

INT8

trtexec --loadEngine=trt_model/mv_lightning_int8.engine

[06/06/2022-16:57:09] [I] === Performance summary ===
[06/06/2022-16:57:09] [I] Throughput: 592.577 qps
[06/06/2022-16:57:09] [I] Latency: min = 1.43872 ms, max = 30.7449 ms, mean = 1.65716 ms, median = 1.56552 ms, percentile(99%) = 2.74823 ms
[06/06/2022-16:57:09] [I] End-to-End Host Latency: min = 1.45007 ms, max = 30.7653 ms, mean = 1.66662 ms, median = 1.57373 ms, percentile(99%) = 2.83728 ms
[06/06/2022-16:57:09] [I] Enqueue Time: min = 0.93457 ms, max = 31.7563 ms, mean = 1.19099 ms, median = 1.02924 ms, percentile(99%) = 3.43677 ms
[06/06/2022-16:57:09] [I] H2D Latency: min = 0.0237732 ms, max = 1.07031 ms, mean = 0.0273041 ms, median = 0.0261841 ms, percentile(99%) = 0.0429688 ms
[06/06/2022-16:57:09] [I] GPU Compute Time: min = 1.41165 ms, max = 30.6662 ms, mean = 1.62093 ms, median = 1.53691 ms, percentile(99%) = 2.61005 ms
[06/06/2022-16:57:09] [I] D2H Latency: min = 0.00195312 ms, max = 10.9327 ms, mean = 0.00892758 ms, median = 0.00244141 ms, percentile(99%) = 0.0265503 ms
[06/06/2022-16:57:09] [I] Total Host Walltime: 3.00383 s
[06/06/2022-16:57:09] [I] Total GPU Compute Time: 2.88525 s

In both cases the mean is below 2ms, which means that the inference rate should be ~100Hz. However when I run the models using Python and the tensorrt libraries, I get about 11~13 FPS using either the FP16 or INT8 models. The performance is still not great.

Python Script

movenet_singlepose_trt.py is main python file which loads the engine and input file (video) from which the frames are derived and the inference is performed.

In line 18, you should be able to set the path to the TRT model.
The function getPoseFromVideo(video_path) is used to load a video using OpenCV and run the inference on the video frames.
Place the file FPS.py inside utils/

movenet_singlepose_trt.py (7.0 KB)
FPS.py (957 Bytes)

Models:
model_float32.onnx (9.0 MB)
mv_lightning_fp16.engine (7.7 MB)
mv_lightning_int8.engine (5.1 MB)

Is there anything that I’m not doing correctly in this script?

Thanks a lot!

AastaLLL · June 16, 2022, 8:25am

Hi,

The latency is usually caused by slow pre-processing or post-processing.
Do you read the camera or show the output with OpenCV?

If yes, would you mind switching to our Deepstream library,
It uses hardware for pre-processing and post-processing.

https://developer.nvidia.com/deepstream-sdk

Thanks.

system · July 6, 2022, 7:44am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I am trying to convert the ONNX SSD mobilnet v3 model into TensorRT Engine. I am getting the below error Jetson TX2 tensorrt , tensorflow	24	3777	February 17, 2022
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1425	July 12, 2022
I am trying to convert the ONNX SSD mobilnet v2 model into TensorRT Engine. I am getting the below error Jetson AGX Xavier tensorrt , jetson	8	814	December 8, 2021
Inference Speed Jetson Xavier NX pytorch	6	900	April 12, 2023
Problem converting tensorflow model to TensorRT Jetson Nano tensorrt , tensorflow	5	408	March 26, 2024
Convert onnx to tensorrt error on Jetson Xavier. Jetson AGX Xavier	6	1629	October 18, 2021
Help converting a pytorch model to TensorRT Jetson Xavier NX tensorrt , pytorch	6	2908	October 18, 2021
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	779	April 30, 2024
ONNX to TensorRT conversion Jetson Nano tensorrt , pytorch , onnx	4	393	July 3, 2024
After converting ssdMobilnet from the examples, the model is slower Jetson Xavier NX tensorrt	4	504	October 18, 2021

Human pose detection model (MoveNet) TensorRT conversion on NVIDIA Jetson

Related topics