Human Pose detection model - Isses with converted model output in Deepstream

joaquinsd10 · July 23, 2022, 8:51am

Platform: Jetson Xavier NX - JetPack 4.6

I’m developing an application which depends on Human body pose estimation deep learning models. I’m looking for an accurate and lightweight model that I can deploy on an edge computing device such as the Jetson Xavier.

Having tried to run the model in Python/TensorRT, it was advised that I try to deploy the model using Deepstream to improve performance. For more information please refer to: Post.

Model

The model in question is Movenet which can be downloaded from the TensorFlowHub site.
To build the TensorRT engine required by the NvInfer plugin in Deepstream, I converted the TF model into TensorRT taking the following steps:

1. tf → onnx (x86)

With the onnx/tensorflow-onnx conversion tools,

python3 -m tf2onnx.convert --saved-model ./movenet_singlepose_lightning_4/ --output mnli_nchw.onnx --inputs-as-nchw input

The order of the layers must be NCHW for Deepstream therefore I set the argument input-as-nchw for the input of the network.

2. onnx → TensorRT (Xavier)

/usr/src/tensorrt/bin/trtexec --onnx=mnli_nchw.onnx --saveEngine=mnli_nchw.engine --verbose

Deepstream - Python Bindings

I decided to reference apps/deepstream-test1 from deepstream_python_apps repo. In the modified script input is taken as video in the .H264 format and inference is performeed in each of the frames. A sink probe is used to access the meta data and tensor information given by the Gst-nvinfer and used to draw the output on the screen using the Gst-nvdsosd plugin. Please refer to the file:
ds_movenet_pipeline.py (10.9 KB)

The configuration for the inference engine is given by:
ds_pgie_config.txt (2.6 KB)

The output tesor has the dimension [1x17x3] with the first 2 channels of the last dimension being the yx coordinates of the body landmarks (Nose, Left eye, Right Eye… )and the third dimension of the last channel representing the prediction confidence scores. Ref

Note that In the script I split the output of the tensor into 2 arrays, a [17,2] shaped array with the coordinates (xy) and a [17,1] array with the score information.

However the output of the network , as shown below, is wrong.

Screenshot 2022-07-22 at 11.51.15 AM

I was wondering whether someone could help me debug the application so that I can run this model on the Jetson Xavier.

Thanks a lot!

yuweiw · July 25, 2022, 6:13am

Hi @joaquinsd10 , could you confirm that the model can correctly output point information? Is the video resolution you setted(640x480) to the streammux right?
Also, you may refer the link below. We have an example for pose estimation witch wrote in C++.
https://github.com/NVIDIA-AI-IOT/deepstream_pose_estimation

joaquinsd10 · July 25, 2022, 7:18am

The output of the TensorFlow model when I run the inference in a Jupyter Notebook on my desktop is the following.

running_inf

I set the input resolution to (640x480) and I scale output coordinates accordingly.

I’d like to use the Movenet Model as I’ve gotten good results for my application, I also plan to run the model in multiple platforms in the future so I’d like to keep the models the same if possible.

yuweiw · July 25, 2022, 9:47am

Hi @joaquinsd10 , It is not about changing your model that I attached our pose estimation demo link. You can just compare the config file with it. Such as network-mode: 0: FP32 1: INT8 2: FP16 network-type:0: Detector 1: Classifier 2: Segmentation . You should set the right parameters.
Also, What’s your deepstream version? We suggest you use the 6.1.0 version to test it.
You can refer the link below about how to set the config file and how to convert the coordinate.
https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/tree/master/apps/tao_others/deepstream-bodypose2d-app

yingliu · August 16, 2022, 3:07am

Hello @joaquinsd10 , still waiting for your update, thank you.

joaquinsd10 · August 16, 2022, 6:46am

Currently I’m using DS 6.0. I’ll upgrade to 6.1 to see if there is any difference in the results.

I’ve tried different settings for the NVINFER config file.

network-type: I’m able to run the program when this field is either set to 100 (Others) or 1 (Classifier). For any other network-type the program return an error message or crashes.

network-mode: I get varying results as shown below:

network-mode:0

network-mode:1

network-mode:2

Comparing against deepstream-bodypose2d-app that you linked the configuration files are quite similar already.

Could it be any issues with layer support? Any layers that are not supported by DS?

yingliu · August 16, 2022, 7:52am

We prefer waiting the result on DS6.1, so let’s keep the topic open and waiting for your result on DS6.1, thanks.

joaquinsd10 · August 16, 2022, 11:40am

I tried running the pipeline using the deepstream:6.1-devel container and installting the DS python bindings .After running the python script I get the following output:

As you can see the pose landmarks for the Tai Chi demo video are also wrong, just like the previous example (with the person running)

yuweiw · August 25, 2022, 9:16am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Cause it runs well in a Jupyter Notebook on your desktop. Could you help to compare the coordinate generated between your own inference and deepstream with the same source stream? Also you should set the same scaling for the test.

joaquinsd10 · September 13, 2022, 5:44am

Hi YuWeiw,

I’m really sorry for the late reply.

Jupyter Notebook (Colab) output

array([[[[0.20632008, 0.5264436 , 0.5500399 ],
         [0.19438183, 0.5423147 , 0.5101172 ],
         [0.19940856, 0.525432  , 0.5194653 ],
         [0.20174555, 0.5762809 , 0.6259115 ],
         [0.2074162 , 0.5403847 , 0.36001524],
         [0.24626063, 0.61766636, 0.5737878 ],
         [0.28041622, 0.56218743, 0.43795484],
         [0.28720313, 0.70531404, 0.41041154],
         [0.3735156 , 0.5596073 , 0.25380737],
         [0.38415706, 0.6840706 , 0.5852949 ],
         [0.38440403, 0.4962165 , 0.5360341 ],
         [0.4712838 , 0.6524867 , 0.53902715],
         [0.47330508, 0.59636825, 0.5243149 ],
         [0.6045654 , 0.74344385, 0.52052164],
         [0.5830768 , 0.49873745, 0.4820709 ],
         [0.7081698 , 0.8761561 , 0.538491  ],
         [0.7466623 , 0.5525544 , 0.5850646 ]]]], dtype=float32)

Deepstream 6.1 Output

array([[[0.995168  , 0.6820843 , 0.13510899],
        [0.991431  , 0.692891  , 0.10633575],
        [0.99171036, 0.6779563 , 0.16247985],
        [0.9736485 , 0.6928706 , 0.10316519],
        [0.97919816, 0.6698861 , 0.25518477],
        [0.99069613, 0.6899451 , 0.1052241 ],
        [0.9894345 , 0.6395569 , 0.08138786],
        [0.99406284, 0.7011148 , 0.03583372],
        [0.48855978, 0.2775157 , 0.00691245],
        [0.6344738 , 0.6794286 , 0.00502333],
        [0.9739734 , 0.68779314, 0.05142531],
        [0.44666773, 0.63642704, 0.00864163],
        [0.4438305 , 0.41869292, 0.01471352],
        [0.54173696, 0.6425294 , 0.01174647],
        [0.5202968 , 0.39811012, 0.03536683],
        [0.8326208 , 0.5646678 , 0.01073348],
        [0.75177586, 0.42859802, 0.00596068]]])

The corresponding outputs are generated using the same source image of the runner.

yuweiw · September 13, 2022, 6:15am

Can you confirm that these two arrays are the output of the same video frame?

joaquinsd10 · September 13, 2022, 6:22am

Yes the arrays are the output of the same video frame

joaquinsd10 · September 13, 2022, 7:00am

Additionally I do not set any scaling parameter in neither of the inference engines.

yuweiw · September 19, 2022, 11:27am