Human Pose detection model - Isses with converted model output in Deepstream

Platform: Jetson Xavier NX - JetPack 4.6

I’m developing an application which depends on Human body pose estimation deep learning models. I’m looking for an accurate and lightweight model that I can deploy on an edge computing device such as the Jetson Xavier.

Having tried to run the model in Python/TensorRT, it was advised that I try to deploy the model using Deepstream to improve performance. For more information please refer to: Post.

Model

The model in question is Movenet which can be downloaded from the TensorFlowHub site.
To build the TensorRT engine required by the NvInfer plugin in Deepstream, I converted the TF model into TensorRT taking the following steps:

1. tf → onnx (x86)

With the onnx/tensorflow-onnx conversion tools,

python3 -m tf2onnx.convert --saved-model ./movenet_singlepose_lightning_4/ --output mnli_nchw.onnx --inputs-as-nchw input

The order of the layers must be NCHW for Deepstream therefore I set the argument input-as-nchw for the input of the network.

2. onnx → TensorRT (Xavier)

/usr/src/tensorrt/bin/trtexec --onnx=mnli_nchw.onnx --saveEngine=mnli_nchw.engine --verbose

Deepstream - Python Bindings

I decided to reference apps/deepstream-test1 from deepstream_python_apps repo. In the modified script input is taken as video in the .H264 format and inference is performeed in each of the frames. A sink probe is used to access the meta data and tensor information given by the Gst-nvinfer and used to draw the output on the screen using the Gst-nvdsosd plugin. Please refer to the file:
ds_movenet_pipeline.py (10.9 KB)

The configuration for the inference engine is given by:
ds_pgie_config.txt (2.6 KB)

The output tesor has the dimension [1x17x3] with the first 2 channels of the last dimension being the yx coordinates of the body landmarks (Nose, Left eye, Right Eye… )and the third dimension of the last channel representing the prediction confidence scores. Ref

Note that In the script I split the output of the tensor into 2 arrays, a [17,2] shaped array with the coordinates (xy) and a [17,1] array with the score information.

However the output of the network , as shown below, is wrong.

Screenshot 2022-07-22 at 11.51.15 AM

I was wondering whether someone could help me debug the application so that I can run this model on the Jetson Xavier.

Thanks a lot!

Hi @joaquinsd10 , could you confirm that the model can correctly output point information? Is the video resolution you setted(640x480) to the streammux right?
Also, you may refer the link below. We have an example for pose estimation witch wrote in C++.
https://github.com/NVIDIA-AI-IOT/deepstream_pose_estimation

The output of the TensorFlow model when I run the inference in a Jupyter Notebook on my desktop is the following.

running_inf

I set the input resolution to (640x480) and I scale output coordinates accordingly.

I’d like to use the Movenet Model as I’ve gotten good results for my application, I also plan to run the model in multiple platforms in the future so I’d like to keep the models the same if possible.

Hi @joaquinsd10 , It is not about changing your model that I attached our pose estimation demo link. You can just compare the config file with it. Such as network-mode: 0: FP32 1: INT8 2: FP16 network-type:0: Detector 1: Classifier 2: Segmentation . You should set the right parameters.
Also, What’s your deepstream version? We suggest you use the 6.1.0 version to test it.
You can refer the link below about how to set the config file and how to convert the coordinate.
https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/tree/master/apps/tao_others/deepstream-bodypose2d-app