Pose estimation using TRT (trt_pose) - slightly lower framerates than stated in inference


I recently stumbled upon NVIDIA’s repo implementing accelerated pose estimation using TensorRT (GitHub - NVIDIA-AI-IOT/trt_pose: Real-time pose estimation accelerated with NVIDIA TensorRT). I made a stripped down C++ version of this implementation by extracting and serializing the TensorRT engine from the torch2trt output and running inference on it directly from C++.

The inference works well, but I notice that the inference FPS is slightly lower than stated in the repo. I am running the inference on a Jetson Nano using the resnet18_baseline_att_224x224 model which should, according to the repo, run at 22 FPS (excluding pre and post-processing I assume). When I measure the the time it takes to copy the input to the device buffers, run inference and copy the output back to the host buffers I get about 18 FPS instead of 22. Below is the code that I timed:

// input should be formatted as CHW and RGB while Mat objects are formatted as HWC and BGR.
	// therefore copy to buffer one channel after another in RGB order
	NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0], output_channels[2].data,
									config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
	NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0] + config.input_size.area(), output_channels[1].data,
									config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
	NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0] + 2 * config.input_size.area(), output_channels[0].data,
									config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,

	// do the inference
	execution_context->enqueue(1, device_buffers, cuda_stream, nullptr);

	// copy output from device buffer to host buffer
	NV_CUDA_CHECK(cudaMemcpyAsync(output0_host_buffer, device_buffers[1],
								  config.num_part_types * config.output_map_size.area() * sizeof(float),
								  cudaMemcpyDeviceToHost, cuda_stream));
	NV_CUDA_CHECK(cudaMemcpyAsync(output1_host_buffer, device_buffers[2],
								  2 * config.num_link_types * config.output_map_size.area() * sizeof(float),
								  cudaMemcpyDeviceToHost, cuda_stream));

	// block until all GPU-related operations have ended for this inference

Is there a way to squeeze those missing 4 FPS out of the network? I am not a high-performance guy so I may be blind to some inefficiencies in my code.




Sorry for the late reply.
Let me check this and get back to you soon.



The GitHub contains a demo script already. Would you mind to try it first?

For best performance, you will need to execute the following commands in order to maximize the device.

sudo nvpmodel -m 0
sudo jetson_clocks


Tried it now, same performace.

This was the first thing I tried. I tried it through Jupyter Notebook and it started slowly and then it became so slow that it froze my computer (Jetson Nano). I therefore moved the code into a simple python file. The benchmark code did give me 22 FPS. However, when I used the CSICamera module for my CSI camera it failed to open with the following message:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jetcam-0.0.0-py3.6.egg/jetcam/csi_camera.py", line 24, in __init__
RuntimeError: Could not read image from camera.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "live_demo.py", line 61, in <module>
    camera = CSICamera(width=WIDTH, height=HEIGHT, capture_width=1280, capture_height=720, capture_fps=60)
  File "/usr/local/lib/python3.6/dist-packages/jetcam-0.0.0-py3.6.egg/jetcam/csi_camera.py", line 27, in __init__
RuntimeError: Could not initialize camera.  Please see error trace.

Looking at the CSICamera class I see it is based on opening a GStreamer pipeline through OpenCV. It should be noted that in my C++ version I open a GStreamer pipeline through OpenCV and it works fine. Here’s the python script that I ran, based on the ipython notebook (inference only, already created optimized torch model):

import json
import trt_pose.coco
import trt_pose.models
import torch
import torch2trt
import time
import cv2
import torchvision.transforms as transforms
import PIL.Image
import ipywidgets

from torch2trt import TRTModule
from trt_pose.draw_objects import DrawObjects
from trt_pose.parse_objects import ParseObjects
from jetcam.csi_camera import CSICamera
from jetcam.utils import bgr8_to_jpeg
from IPython.display import display

WIDTH = 224
HEIGHT = 224

OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_A_epoch_249_trt.pth'

model_trt = TRTModule()

import time

data = torch.zeros((1, 3, HEIGHT, WIDTH)).cuda()

t0 = time.time()
for i in range(50):
    y = model_trt(data)
t1 = time.time()

print(50.0 / (t1 - t0))

mean = torch.Tensor([0.485, 0.456, 0.406]).cuda()
std = torch.Tensor([0.229, 0.224, 0.225]).cuda()
device = torch.device('cuda')

with open('human_pose.json', 'r') as f:
    human_pose = json.load(f)

topology = trt_pose.coco.coco_category_to_topology(human_pose)

def preprocess(image):
    global device
    device = torch.device('cuda')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = PIL.Image.fromarray(image)
    image = transforms.functional.to_tensor(image).to(device)
    image.sub_(mean[:, None, None]).div_(std[:, None, None])
    return image[None, ...]

parse_objects = ParseObjects(topology)
draw_objects = DrawObjects(topology)

camera = CSICamera(width=WIDTH, height=HEIGHT, capture_fps=30)

camera.running = True

image_w = ipywidgets.Image(format='jpeg')


def execute(change):
    image = change['new']
    data = preprocess(image)
    cmap, paf = model_trt(data)
    cmap, paf = cmap.detach().cpu(), paf.detach().cpu()
    counts, objects, peaks = parse_objects(cmap, paf)#, cmap_threshold=0.15, link_threshold=0.15)
    draw_objects(image, counts, objects, peaks)
    image_w.value = bgr8_to_jpeg(image[:, ::-1, :])

execute({'new': camera.value})

camera.observe(execute, names='value')


Let me check this internally and get back to you soon.



Sorry for the late.
It looks like this issue turn to how to enable your camera from OpenCV now.

Is your camera a CSI based camera?
If yes, could you help to check if it support 224x224 resolution first?

v4l2-ctl -l


Yes, I have a CSI camera.

v4l2-ctl -l gives the following output:

Camera Controls

                     group_hold 0x009a2003 (bool)   : default=0 value=0 flags=execute-on-write
                    sensor_mode 0x009a2008 (int64)  : min=0 max=5 step=1 default=0 value=4 flags=slider
                           gain 0x009a2009 (int64)  : min=16 max=170 step=1 default=16 value=101 flags=slider
                       exposure 0x009a200a (int64)  : min=13 max=683709 step=1 default=2495 value=8333 flags=slider
                     frame_rate 0x009a200b (int64)  : min=2000000 max=120000000 step=1 default=120000000 value=60000003 flags=slider
                    bypass_mode 0x009a2064 (intmenu): min=0 max=1 default=0 value=1
                override_enable 0x009a2065 (intmenu): min=0 max=1 default=0 value=1
                   height_align 0x009a2066 (int)    : min=1 max=16 step=1 default=1 value=1
                     size_align 0x009a2067 (intmenu): min=0 max=2 default=0 value=0
               write_isp_format 0x009a2068 (bool)   : default=0 value=0
       sensor_signal_properties 0x009a2069 (u32)    : min=0 max=4294967295 step=1 default=0 [30][18] flags=read-only, has-payload
        sensor_image_properties 0x009a206a (u32)    : min=0 max=4294967295 step=1 default=0 [30][16] flags=read-only, has-payload
      sensor_control_properties 0x009a206b (u32)    : min=0 max=4294967295 step=1 default=0 [30][34] flags=read-only, has-payload
              sensor_dv_timings 0x009a206c (u32)    : min=0 max=4294967295 step=1 default=0 [30][16] flags=read-only, has-payload
               low_latency_mode 0x009a206d (bool)   : default=0 value=0
                   sensor_modes 0x009a2082 (int)    : min=0 max=30 step=1 default=30 value=5 flags=read-only

As I stated before, the framerate I measured was of the inference only, not including grabbing the frame or any pre-processing or post-processing.


Sorry for the late reply.

May I know how do you install your OpenCV python package?
Do you use our default version or rebuilt it from source?

More, could you help to print out the support matrix of openCV first?

>>> import cv2
>>> cv2.__version__
>>> print cv2.getBuildInformation()



I installed OpenCV 4 with CUDA according to your script in https://github.com/AastaNV/JEP/blob/master/script/install_opencv4.1.1_Jetson.sh
When looking at /usr/local/lib and /usr/local/include/opencv4 I see the libraries and includes of OpenCV 4.

However, after running your Python commands, it shows that the OpenCV version is 3.2 and that it does not use CUDA. It is possible that my C++ implementation referred to OpenCV 4 and your python script referred to OpenCV 3.2 (and that I have two OpenCV installations. Yikes!). This can also explain why it was extremely slow in Python but reasonable in C++. I will write here the output of the Python commands later when I have access to my computer.

I tried to search for the OpenCV 3.2 installation but didn’t find it yet. How can I find the sources and includes?


It looks like there are some different openCV package on your environment.
Please make sure you have added the compiled python package into the PYTHONPATH first.

echo 'export PYTHONPATH=$PYTHONPATH:[<Install Folder>]/opencv-4.1.1/release/python_loader/' >> ~/.bashrc
source ~/.bashrc


Would you mind sharing your full code for the C++ equivalent of trt_pose ?

I’m trying to achieve the same thing (after being disappointed with the OpenCV performance)…

@yinondou could you please give me your test code? I did rewrite the demo notebook into python file but when it is killed when I save model. I test in Jetson nano, pytorch 1.6.0. Tks