Human pose detection model (MoveNet) TensorRT conversion on NVIDIA Jetson

I’m working on a project which depends on Deep learning pose-estimation model from TensorFlow’s MoveNet.
We are working with a Jetson Xavier NX Developer kit.

We would like to run the model using TensorRT and for this purpose we tried the following conversion steps:

tflite -> ONNX32 -> ONNX16 -> TensorRT

Conversion from tflite to ONNX was done through PINTO model’s zoo conversion script, here. To convert the model from FP32 to FP16 I used I used the pip package onnxmltools.

Subsequently I ran the trtexec command on the Jetson to convert the ONNX model to TensorRT

/usr/src/tensorrt/bin/trtexec --onnx=model_float16.onnx --saveEngine=model_fp16.trt

However the conversion tool is returning an error, and it seems to be a problem when trying to cast down from INT64 to INT32. The error message refers to a particular node Resize__242 and I provide screenshots of the node information I obtained using a model visualization tool:

Input filename:   model_float16.onnx
ONNX IR version:  0.0.6
Opset version:    11
Producer name:    tf2onnx
Producer version: 1.9.3
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
[2022-05-29 15:02:49    INFO] [MemUsageChange] Init CUDA: CPU +473, GPU +0, now: CPU 489, GPU 1196 (MiB)
[2022-05-29 15:02:50    INFO] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 489 MiB, GPU 1196 MiB
[2022-05-29 15:02:50    INFO] [MemUsageSnapshot] End constructing builder kernel library: CPU 643 MiB, GPU 1238 MiB
Parsing model
[2022-05-29 15:02:50 WARNING] onnx2trt_utils.cpp:370: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[2022-05-29 15:02:50   ERROR] (Unnamed Layer* 111) [Constant]:constant weights has count 0 but 1 was expected
While parsing node number 207 [Cast -> "Resize__242_input_cast_1"]:
--- Begin node ---
input: "roi__271"
output: "Resize__242_input_cast_1"
name: "Resize__242_input_cast1"
op_type: "Cast"
attribute {
  name: "to"
  i: 1
  type: INT
}
--- End node ---
ERROR: ModelImporter.cpp:179 In function parseGraph:
[6] Invalid Node - Resize__242_input_cast1
(Unnamed Layer* 111) [Constant]:constant weights has count 0 but 1 was expected

Resize__242 Node

image

image

Is there a problem with my ONNX_FP16 model which causes the error? Is this particular operation not supported by TensorRT? Any tips or help will be greatly appreciated!

Your topic was posted in the wrong category. I am moving this to the Jetson Xavier NX category for visibility.

1 Like

Hi,

Since Tensor has a mechanism to cast the model into fp16 or int8, would you mind testing if ONNX32 -> TensorRT works?

Thanks.

Hello AsstaLL,

I ran trtexec and passed the ONNX32 model as input. The conversion worked on the Xavier without any errors.

Next I tried to run the model using a python script which I attach to this post. It loads frames from a video, pre-processes the images and performs inference to get the body-landmarks. My scripts is supposed to draw the landmarks on an output frame as well as show the current FPS value.

The FPS is only about 1! It’s very low compared and I hoped to achieve much faster speeds once you convert to TensorRT. It may be something wrong with the way I load the TRT engine and perform the inference? For reference I attach the script that I use.

demo_singlepose_trt.py (6.4 KB)

FPS.py (957 Bytes)

To run it, create a folder called utils and and paste FPS.py python file inside it. You should be able to run python3 demo_singlepose_trt.py. You can change the variable input_fp to point to a suitable video if your choice.

Conversion on x86-64
In parallel I’ve been trying to run the model using my workstation ( x86-64) by converting the model with TensorRT Backend For ONNX on the TensorRT-OSS build container. After converting the ONNX32 model, I try to run the engine by running another container, NCG tensorrt:22.05-py3, however I get the following error:

[05/31/2022-02:53:12] [TRT] [E] 3: Cannot find binding of given name: input
[05/31/2022-02:53:12] [TRT] [E] 3: [executionContext.cpp::setBindingDimensions::928] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::928, condition: mEngine.bindingIndexBelongsToProfile( bindingIndex, mOptimizationProfile, "IExecutionContext::setBindingDimensions")

I’ve been working with both the Jetson board and my workstation and the last section I discuss difficulties in trying to convert MoveNet on a different platform that is not the Jetson Xavier. Do you think I should post it on another Topic?

Hi,

You can use the same topic to check.

Could you profile the network with trtexec first? It should show some inference profiling results.
For example:

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
...
[06/06/2022-06:37:49] [I] === Performance summary ===
[06/06/2022-06:37:49] [I] Throughput: 21456.3 qps
[06/06/2022-06:37:49] [I] Latency: min = 0.032959 ms, max = 0.105469 ms, mean = 0.0356163 ms, median = 0.0354004 ms, percentile(99%) = 0.0411072 ms
[06/06/2022-06:37:49] [I] Enqueue Time: min = 0.0213623 ms, max = 0.0722656 ms, mean = 0.0231733 ms, median = 0.0229492 ms, percentile(99%) = 0.02771 ms
[06/06/2022-06:37:49] [I] H2D Latency: min = 0.00268555 ms, max = 0.0361328 ms, mean = 0.00406139 ms, median = 0.00415039 ms, percentile(99%) = 0.00476074 ms
[06/06/2022-06:37:49] [I] GPU Compute Time: min = 0.027832 ms, max = 0.0783691 ms, mean = 0.0296538 ms, median = 0.029541 ms, percentile(99%) = 0.0322266 ms
[06/06/2022-06:37:49] [I] D2H Latency: min = 0.0012207 ms, max = 0.0146484 ms, mean = 0.00190149 ms, median = 0.00183105 ms, percentile(99%) = 0.00439453 ms
[06/06/2022-06:37:49] [I] Total Host Walltime: 3.00005 s
[06/06/2022-06:37:49] [I] Total GPU Compute Time: 1.90882 s
[06/06/2022-06:37:49] [W] * GPU compute time is unstable, with coefficient of variance = 3.11681%.
[06/06/2022-06:37:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/06/2022-06:37:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/06/2022-06:37:49] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx

It’s also recommended to test it with --fp16 or --int8 as well.

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --fp16
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --int8

More, please remember to maximize your device performance first.
This can be done by running the following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hello AastaLL,

Using trtexec and running the profiling tests, we get the follwing:

FP16

trtexec --loadEngine=trt_model/mv_lightning_fp16.engine

[06/06/2022-16:53:26] [I] === Performance summary ===
[06/06/2022-16:53:26] [I] Throughput: 497.66 qps
[06/06/2022-16:53:26] [I] Latency: min = 1.8024 ms, max = 15.8726 ms, mean = 1.96301 ms, median = 1.90088 ms, percentile(99%) = 2.04663 ms
[06/06/2022-16:53:26] [I] End-to-End Host Latency: min = 1.80756 ms, max = 15.8884 ms, mean = 1.97097 ms, median = 1.90775 ms, percentile(99%) = 2.17896 ms
[06/06/2022-16:53:26] [I] Enqueue Time: min = 0.938721 ms, max = 16.2527 ms, mean = 1.16295 ms, median = 1.02881 ms, percentile(99%) = 2.83414 ms
[06/06/2022-16:53:26] [I] H2D Latency: min = 0.0211182 ms, max = 0.0859375 ms, mean = 0.0240434 ms, median = 0.0229492 ms, percentile(99%) = 0.0281372 ms
[06/06/2022-16:53:26] [I] GPU Compute Time: min = 1.77563 ms, max = 15.7936 ms, mean = 1.93673 ms, median = 1.87494 ms, percentile(99%) = 2.02173 ms
[06/06/2022-16:53:26] [I] D2H Latency: min = 0.00146484 ms, max = 0.332764 ms, mean = 0.0022402 ms, median = 0.00180054 ms, percentile(99%) = 0.0022583 ms
[06/06/2022-16:53:26] [I] Total Host Walltime: 3.00406 s
[06/06/2022-16:53:26] [I] Total GPU Compute Time: 2.89541 s

INT8

trtexec --loadEngine=trt_model/mv_lightning_int8.engine

[06/06/2022-16:57:09] [I] === Performance summary ===
[06/06/2022-16:57:09] [I] Throughput: 592.577 qps
[06/06/2022-16:57:09] [I] Latency: min = 1.43872 ms, max = 30.7449 ms, mean = 1.65716 ms, median = 1.56552 ms, percentile(99%) = 2.74823 ms
[06/06/2022-16:57:09] [I] End-to-End Host Latency: min = 1.45007 ms, max = 30.7653 ms, mean = 1.66662 ms, median = 1.57373 ms, percentile(99%) = 2.83728 ms
[06/06/2022-16:57:09] [I] Enqueue Time: min = 0.93457 ms, max = 31.7563 ms, mean = 1.19099 ms, median = 1.02924 ms, percentile(99%) = 3.43677 ms
[06/06/2022-16:57:09] [I] H2D Latency: min = 0.0237732 ms, max = 1.07031 ms, mean = 0.0273041 ms, median = 0.0261841 ms, percentile(99%) = 0.0429688 ms
[06/06/2022-16:57:09] [I] GPU Compute Time: min = 1.41165 ms, max = 30.6662 ms, mean = 1.62093 ms, median = 1.53691 ms, percentile(99%) = 2.61005 ms
[06/06/2022-16:57:09] [I] D2H Latency: min = 0.00195312 ms, max = 10.9327 ms, mean = 0.00892758 ms, median = 0.00244141 ms, percentile(99%) = 0.0265503 ms
[06/06/2022-16:57:09] [I] Total Host Walltime: 3.00383 s
[06/06/2022-16:57:09] [I] Total GPU Compute Time: 2.88525 s

In both cases the mean is below 2ms, which means that the inference rate should be ~100Hz. However when I run the models using Python and the tensorrt libraries, I get about 11~13 FPS using either the FP16 or INT8 models. The performance is still not great.

Python Script

movenet_singlepose_trt.py is main python file which loads the engine and input file (video) from which the frames are derived and the inference is performed.

  • In line 18, you should be able to set the path to the TRT model.
  • The function getPoseFromVideo(video_path) is used to load a video using OpenCV and run the inference on the video frames.
  • Place the file FPS.py inside utils/

movenet_singlepose_trt.py (7.0 KB)
FPS.py (957 Bytes)

Models:
model_float32.onnx (9.0 MB)
mv_lightning_fp16.engine (7.7 MB)
mv_lightning_int8.engine (5.1 MB)

Is there anything that I’m not doing correctly in this script?

Thanks a lot!

Hi,

The latency is usually caused by slow pre-processing or post-processing.
Do you read the camera or show the output with OpenCV?

If yes, would you mind switching to our Deepstream library,
It uses hardware for pre-processing and post-processing.

Thanks.