Conversion of model weights for human pose estimation model to ONNX results in nonsensical pose estimation

Dear NVIDIA Developers,

I’m having issues with converting the pose estimation model weights to ONNX format. I’m refering to Step 2 of the blog post that explains how to create a human pose estimation application with DeepStream. If I use the already existing pose_estimation.onnx model available from the DeepStream Human Pose Estimation GitHub repo, the output from the pose estimation application is somewhat good. However, when I try to convert the model weights to ONNX myself and then edit the configuration file (deepstream_pose_estimation_config.txt) to use those model weights I converted myself, the output from the pose estimation application becomes so much worse that it’s unusable. In this post, I will share with you all the steps I do in order to convert the model weights to the ONNX format in order for you to be able to re-create the error.

Hadrware information:

Hardware Platform (Jetson / GPU): Tesla K80
DeepStream Version: None needed to reproduce this bug
TensorRT Version: None needed to reproduce this bug
NVIDIA GPU Driver Version (valid for GPU only) 455.32.00
Issue Type( questions, new requirements, bugs) Bugs

How to reproduce the issue ?

Here are the detailed steps to reproduce this bug:

  1. I first download the Docker image which I will use to convert the model weights to the ONXX model. I download it using the following link: NVIDIA NGC I download the 19.12 version of the PyTorch Docker container with the command: sudo docker pull nvcr.io/nvidia/pytorch:19.12-py3. I noticed other versions of this container give me the error Cuda error: no kernel image is available for execution on the device.
  2. I run the Docker container with the following command: sudo docker run --gpus all -it -v /home:/home nvcr.io/nvidia/pytorch:19.12-py3. Now I’m inside the Docker container.
  3. Now I go on to uninstall and install some package, because if I don’t do that I get some errors from gcc when trying to build torch2trt. The commands I run are below:
    a. pip uninstall torch
    b. pip install torch
    c. pip uninstall torchvision
    d. pip install torchvision
    e. pip uninstall tqdm
    f. pip install tqdm
    g. pip uninstall cython
    h. pip install cython
    i. pip uninstall pycocotools
    j. pip install pycocotools
  4. I execute the command: git clone https://github.com/NVIDIA-AI-IOT/torch2trt
  5. I execute the command: cd torch2trt
  6. I execute the command: python3 setup.py install --plugins
  7. I exit the torch2trt repository by executing the following command: cd ..
  8. I execute the command: git clone https://github.com/NVIDIA-AI-IOT/trt_pose.git
  9. I execute the command: cd trt_pose
  10. I execute the command: python setup.py install
  11. I execute the command: cd ..
  12. I download the model weights from https://drive.google.com/file/d/1XYDdCUdiF2xxx4rznmLb62SdOUZuoNbd/view. I do this by executing the command: wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1XYDdCUdiF2xxx4rznmLb62SdOUZuoNbd' -O resnet18_baseline_att_224x224_A_epoch_249.pth
  13. I copy the downloaded model weights to trt_pose/tasks/human_pose with the command cp resnet18_baseline_att_224x224_A_epoch_249.pth ./trt_pose/tasks/human_pose
  14. I go to trt_pose/tasks/human_pose and copy the weights of the model and the file human_pose.json to trt_pose/trt_pose/utils directory. If I don’t do this, and try to run python export_for_isaac.py --input_checkpoint ../../tasks/human_pose/resnet18_baseline_att_224x224_A_epoch_249.pth, I get the following error message: Input model is not specified and can not be inferenced from the name of the checkpoint ../../tasks/human_pose/resnet18_baseline_att_224x224_A_epoch_249.pth. Please specify the model name (trt_pose.models function name). If I copy just the model weights and not the human_pose.json, I get the error message saying: Input topology human_pose.json is not a valid (.json) file.. I copy the weights of the model and the human_pose.json file with the commands below:
    a. cd trt_pose/tasks/human_pose/
    b. cp resnet18_baseline_att_224x224_A_epoch_249.pth human_pose.json ../../trt_pose/utils/
  15. I go to the directory trt_pose/trt_pose/utils with the command cd ../../trt_pose/utils/
  16. I execute the command: chmod +x export_for_isaac.py
  17. I execute the command: python export_for_isaac.py --input_checkpoint resnet18_baseline_att_224x224_A_epoch_249.pth. This is successful. I get the message Successfully completed convertion of resnet18_baseline_att_224x224_A_epoch_249.pth to resnet18_baseline_att_224x224_A_epoch_249.onnx..
  18. I copy the resulting file, resnet18_baseline_att_224x224_A_epoch_249.onnx to my virtual machine via scp.
  19. I exit the Docker container by executing the command exit.

When I execute all of the steps above, I get the weights in the ONNX format, but as I noted in the introduction, even though the conversion is successful, the output of the pose estimation application is much worse when I use the resnet18_baseline_att_224x224_A_epoch_249.onnx file I converted following the steps above as opposed to using the pose_estimation.onnx file in the DeepStream Human Pose Estimation GitHub repo.

What is going on here? What am I doing wrong in the model weights conversion process? How do I fix it?

Best regards

Hi,

The step you shared looks correct to us.
A possible issue is that there is a pre-generated engine file, which prevents Deepstream from recompiling the model from the onnx file.

Would you mind to check if there is any file named as *.engine?
If yes, please delete the file and rerun the pipeline.

Thanks.

Hello @AastaLLL,

I tried doing what you suggested when running the pose estimation app. It doesn’t help with fixing the problem. The .engine file is not there before I run the app for the first time and even if I delete it after running the app the nonsensical output is still present. To remind you, the problem is somewhere in the weight conversion process from .pth to .onnx because if I use the default pose_estimation.onnx file available from the GitHub repository of the pose estimation app I get good results.

I also tried converting the model weights from .pth to .onnx by keeping all the files in trt_pose/tasks/human_pose directory and running the script with the following arguments:

export_for_isaac.py --input_checkpoint ../../tasks/human_pose/resnet18_baseline_att_224x224_A_epoch_249.pth --input_model resnet18_baseline --input_width 224 --input_height 224 --input_topology ../../tasks/human_pose/human_pose.json

Then I had some errors while loading state_dict, but I fixed them by changing line 119 of export_for_isaac.py script from:

model.load_state_dict(torch.load(args.input_checkpoint))

to

model.load_state_dict(torch.load(args.input_checkpoint), strict=False)

This method also produces nonsensical (unusable) results.

So, to recap, I tried both the method for converting the weights from my first post in this thread and the other method for converting the weights I just described . Alongside them, I tried deleting the .engine file when it existed and I also tried converting both the resnet and the densenet model weights. None worked.

I also have the same problem like u , and i also have no idea about to solve it . Have u sovle this problem?

Not as of now. I’ve been busy with other stuff. Tagging @AastaLLL to see if he can chip in on the solution.

Hi, both

Guess this issue occurs due to the different model architecture.
For the different pose estimation models, the output may have different semantic meanings.
And you will need to update the parser accordingly.

For example, you can find the default parser and underlying function below:

The default model architecture represents in this script:

Thanks.

Hey @AastaLLL,

I am using the default resnet18_baseline_att model architecture. I download the weights from the trt_pose GitHub repository. So I don’t know why I’m getting this error, since my model architecture is default.

Hi,

Thanks for your patience.
We are still reproducing this issue internally. Will share more information with you later.

Hi,
I converted torch model to onnx by myself by using torch.onnx.export then get the engine via trtexec. But I find that the result worse than when I run by torch. How can I increase the accuracy?

Hi, all

We can reproduce this issue in our environment.
And now are discussing the for cause internally.

Will update more information later.
Thanks.

Hi,

Thanks for your patience.
Please give this script a try: export_for_isaac.py (8.4 KB)

$ python3 export_for_isaac.py --input_checkpoint resnet18_baseline_att_224x224_A_epoch_249.pth --input_topology ../../tasks/human_pose/human_pose.json

We can get correct result with the ONNX file generated from the above script.

Thanks.

1 Like

Hi,

Thanks for replying. My error is in my test. I think I have to retrain the model for better accuracy.

Hi,

Good to know you find the cause.
After you retraining a new model, you can follow the script shared in Feb 9 for the conversion.

Thanks.

This fixed the problem for me too, will this version get included in the next jatson nano image, or the tensorrt git?

Hi
Using the uploaded export_for_isaac.py for densenet, RuntimeError: Invalid name: ‘262’ occurred. How do I modify it to apply it to densenet?
Also, how can I modify the deepstream_pose_estimation_config.txt file to run deepstream pose estimation for densenet?

  • Traceback (most recent call last):
    • File “export_for_isaac.py”, line 182, in
      • main(args)
    • File “export_for_isaac.py”, line 131, in main
      • input_names=input_names, output_names=output_names)
    • File “/usr/local/lib/python3.6/dist-packages/torch/onnx/init.py”, line 230, in export
      • custom_opsets, enable_onnx_checker, use_external_data_format)
    • File “/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py”, line 91, in export
      • use_external_data_format=use_external_data_format)
    • File “/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py”, line 639, in _export
      • dynamic_axes=dynamic_axes)
    • File “/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py”, line 435, in _model_to_graph
      • _set_input_and_output_names(graph, input_names, output_names)
    • File “/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py”, line 712, in _set_input_and_output_names
      • set_names(list(graph.outputs()), output_names, ‘output’)
    • File “/usr/local/lib/python3.6/dist-packages/torch/onnx/utils.py”, line 710, in set_names
      • node.setDebugName(name)
  • RuntimeError: Invalid name: ‘262’

Once arbitrarily, output_names = [“262”, “264”] was changed to output_names = [“part_affinity_fields”, “heatmap”], and the engine was created.
Also, the onnx-file and model-engine-file parts of the config were changed and operated normally.

Please let me know if there’s anything to change to increase the performance.

Hi jhkim2,

Please help to open a new topic if it’s still an issue. Thanks