Hi Nvidia team,
I m working on the git repository of trt_pose (https://github.com/NVIDIA-AI-IOT/trt_pose), which is amazing, using Jetson Xavier and ROS and I would like to ask you some questions about it:
- How did you get an fps number of 251, the maximum I can get is 30 using jetson clocks?
- How can I use bigger models like Resnet50? Are you planning to provide them?
- Will the performance of the Resnet18 model decrease if I increase the resolution to 640x480?
- How can I improve the detection of human keypoints during movement or long distances?
Thanks for reaching out!
Question 1 - Framerate
This was measured a while ago using the following code
t0 = time.time()
for i in range(50):
output = model_trt(input)
t1 = time.time()
print(50.0 / (t1 - t0))
30FPS sounds low. Do you mind sharing your system configuration?
- What power mode is the Jetson in (you can find this with
- What version of JetPack are you using?
Question 2 - Bigger models
Currently, there are no plans to provide bigger models. The current goal was targeting usable accuracy within several meters, at high framerates. However, I’m curious what issue you’re running into with the existing models. This would help me better understand use cases we may not be addressing.
Question 3 - Larger resolutions
Yes, you can expect the framerate will drop as you increase the resolution. Depending on your application, this may be acceptable. For example, if you’re monitoring a larger group of individuals farther away and extremely high framerates are not necessary, you may benefit from increasing the resolution.
Question 4 - Accuracy during movement
One way to improve the accuracy over time may be to use a Kalman filter for each keypoint
- Perform Kalman prediction step for each keypoint. This uses motion model for keypoint to give estimate of keypoint in new frame.
- Detect keypoints for all individuals in current frame
- Match new detections to detections in previous frame.
- Perform Kalman update step for each keypoint that has a match in new frame. This refines estimate of keypoint in current frame. Keypoints without match would rely on prediction from previous frame.
This is one way to incorporate data from previous frames. I’m really not sure how well this would work in this context though since I haven’t tested.
Please let me know if this helps or you have any other questions.