Real time human pose estimation on Jetson Nano (22FPS)

Hi All,

I’m happy to share this human pose detection project we’ve been working on. One of the models runs at 22FPS on Jetson Nano.

We think the performance is sufficient for many cool Jetson Nano applications that we hope you will build.

You can get started immediately by following the Jupyter Notebook live demo (see the README).

Look forward to seeing what you come up with :)

Best,
John

2 Likes

I’m impressed…that includes the dancing!

I don’t have the right camera for it, but I’d like to create something similar to capture a human hand on a mouse and joystick. The goal would be to create an icon display of mouse/joystick movements for inclusion on tutorial style videos or as an overlay to video capture, but without the main operating system having to take part. Basically an independent camera/Xavier edge computing/streaming icons and/or events of slight hand motions.

The main problem is that the camera probably needs to be a bit specialized for close in macro style movements with fine details. What I’m wondering is that on this particular demo, what is the smallest movement which your capture can identify? What happens if you do a close-up for tiny mouse movements? Imagine someone gaming and making fast light twitches with mouse and joystick…what kind of camera would you need to replace your existing camera with to see those fine touches?

Btw, if you’ve ever seen the comedy movie “Galaxy Quest”, where the aliens create a ship with controls based on learning the movements they saw on a fictional show, then that is the basic idea. An ability to describe fine movement via computer of rapid tiny movements. Unfortunately I don’t think a regular stereo camera can capture tiny rapid movements as small as 0.1 mm (or less).

Thank you! The dance moves were the real challenge here :)

And cool! Sounds like an interesting project.

This is using a monocular camera, so the absolute precision largely depends largely on the resolution and distance of the object.

It’s hard to say exactly what camera is necessary, I imagine it would require some experimentation. In addition to the image resolution, you can also experiment with different neural network architectures to trade off accuracy/speed.

Let me know if you have any questions!

Best,
John

Would it be correct to say that if one were pointing a monocular camera straight down on a hand with a mouse, that minor movement of a finger down to press a button would probably not show up? My thought is that stereo is needed, and perhaps close up lenses or wider than normal ocular separation would be required. From what I can tell, something like the Zed stereo camera does not have that close up high res ability…it’s designed for greater distances…probably in need of a wider separation of cameras.

Note: This would require a low latency detect of when the hand pushes a button, and also a very precision idea of lateral movement. It seems the monocular version will have no knowledge of depth.

Hi John,
When I run your live_demon_ipynb on Jupyter notebook,it prompted me an error message "No module of trt_pose.coco and other error. May you send me a complete workable all files of trt_pose project for learning your wonderful project please?
Many thanks,
Francis

Hi,
Can anyone tell me how to trouble shoot the following error on Jupyter notebook when I tried to run trt_pose project?

  1. Error: No “trt_pose.coco”
  2. AttributeError Traceback (most recent call last)
    in ()
    3 MODEL_WEIGHTS = ‘resnet18_baseline_att_224x224_A_epoch_249.pth’
    4
    ----> 5 model.load_state_dict(torch.load(MODEL_WEIGHTS))

AttributeError: ‘module’ object has no attribute ‘load_state_dict’

Thanks,
Francis

Hi @jaybdub, this project looks amazing, best pose estimation model on the jetson series thus far!

I would like to train on my own data for this purpose, in your training script train.py, there is a config.json file that is needed to perform training. Could you provide that the config file that you used to train for your model? Thank you!

Hi jaybdub,

I’d like to examine on our own device. The aim of project is human pose estimation just like yours.But also we want to create simulation which replicated the motions of the human in front of the camera Which programs do we have to install to Jetson Nano?

I’m trying to run this on my Geforce 1060 laptop, for this I’m using the PyTorch 20.06 NGC Container. But I’m receiving this error message in the firt cell of live_demo.ipynb.

import json
import trt_pose.coco

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-9d995d854a02> in <module>
      1 import json
----> 2 import trt_pose.coco

/opt/conda/lib/python3.6/site-packages/trt_pose-0.0.1-py3.6-linux-x86_64.egg/trt_pose/coco.py in <module>
      7 import tqdm
      8 import trt_pose
----> 9 import trt_pose.plugins
     10 import glob
     11 import torchvision.transforms.functional as FT

ImportError: /opt/conda/lib/python3.6/site-packages/trt_pose-0.0.1-py3.6-linux-x86_64.egg/trt_pose/plugins.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIN3c108BFloat16EEEPKNS_6detail12TypeMetaDataEv

Thank you, I was looking for this kind of example to get started and I would like to get the key points and coordinates of body joints Is it possible? the idea is to build a system that alert when someone falls or behave unnaturally, if you have any input, please provide.

Thanks.

@pmario.silva Hmm. This looks like perhaps trt_pose was built against a version of PyTorch different from what you’re currently using. I would try uninstalling trt_pose, and re-install from scratch.

@salmanfaris Hi salmanfaris, you can get the 2D body keypoints from the output of the model. Please let me know if you have questions on how to do this. While pose is particularily useful because it’s offers a nice programatic interface (point locations) and also abstracts away the visual variation of different people, sometimes an end-to-end approach, like training a classification model, may be more robust and easy to continually improve. This is particularily true if the problem is visually simple. If you want to learn how to train your own model, I’d check out the JetBot project. I encourage you to explore which fits your application best.

Please let me know if this helps or you have any questions.

Best,
John

1 Like

Were you able to get this working?

Hi, what is the best way to get the 2D body keypoints from the output of the model? Thank you.

Hi jgoldman,

Thanks for reaching out!

This notebook demonstrates how to run the model and draw keypoints.

The 2D keypoints are parsed from the neural network using the “ParseObjects” function. This returns

  • object_counts: The number of people per image (Tensor of size (Num Images)
  • objects: An (Num Images)x(Number of People)x(Number of Body Part Types) matrix. The values in this matrix correspond the the keypoint index (see next tensor, and are -1 if the keypoint doesn’t exist for that person)
  • normalized_peaks: An (Num Images)x(Number of Body Part Types)x(Maximum Num Possible Keypoints)x(2) tensor containing the keypoint locations in normalized images coordinates [0,1].

For example, “left_eye” is the keypoint with type index=1. (See this for details https://github.com/NVIDIA-AI-IOT/trt_pose/blob/master/tasks/human_pose/human_pose.json).

To get the left eye for the person with index=0 we would do.

image_idx = 0

if object_counts[image_idx ] > 0:
    # there is an object in the first image
    person_idx = 0
    left_eye_type_idx = 1
    left_eye_idx = objects[image_idx, person_idx , left_eye_type_idx ]
    if left_eye_idx > 0:
        # the person has a left eye
        left_eye_location = normalized_peaks[image_idx, left_eye_type_idx, left_eye_idx, :]  # row, col 
        y, x = left_eye_location[0], left_eye_location[1]
        y_pixels, x_pixels = y * height, x * width

You may find these helpful. Apologies that there isn’t currently a helper function to do this parsing into a more intuitive format.


Please let me know if this helps or you have any questions.

Best,
John

Hi,

Is it possible to parse this trt_pose into deepstream and then overlay the pose on the video stream and output? Then use the key points to identify different scenarios?

Thanks

1 Like

fantastic… can i use this on deepstream??

This was extremely helpful, thank you so much :)
Could you offer guidance on how to use a video as input instead of live camera feed?

Thanks for the code @jaybdub helped me in getting this output. https://youtu.be/B2kySE1xxnE

1 Like