Pose estimation using TRT (trt_pose) - slightly lower framerates than stated in inference

yinondou · December 16, 2019, 10:18pm

Hi,

I recently stumbled upon NVIDIA’s repo implementing accelerated pose estimation using TensorRT (GitHub - NVIDIA-AI-IOT/trt_pose: Real-time pose estimation accelerated with NVIDIA TensorRT). I made a stripped down C++ version of this implementation by extracting and serializing the TensorRT engine from the torch2trt output and running inference on it directly from C++.

The inference works well, but I notice that the inference FPS is slightly lower than stated in the repo. I am running the inference on a Jetson Nano using the resnet18_baseline_att_224x224 model which should, according to the repo, run at 22 FPS (excluding pre and post-processing I assume). When I measure the the time it takes to copy the input to the device buffers, run inference and copy the output back to the host buffers I get about 18 FPS instead of 22. Below is the code that I timed:

// input should be formatted as CHW and RGB while Mat objects are formatted as HWC and BGR.
	// therefore copy to buffer one channel after another in RGB order
	NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0], output_channels[2].data,
									config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
									cuda_stream));
	NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0] + config.input_size.area(), output_channels[1].data,
									config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
									cuda_stream));
	NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0] + 2 * config.input_size.area(), output_channels[0].data,
									config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
									cuda_stream));

	// do the inference
	execution_context->enqueue(1, device_buffers, cuda_stream, nullptr);

	// copy output from device buffer to host buffer
	NV_CUDA_CHECK(cudaMemcpyAsync(output0_host_buffer, device_buffers[1],
								  config.num_part_types * config.output_map_size.area() * sizeof(float),
								  cudaMemcpyDeviceToHost, cuda_stream));
	NV_CUDA_CHECK(cudaMemcpyAsync(output1_host_buffer, device_buffers[2],
								  2 * config.num_link_types * config.output_map_size.area() * sizeof(float),
								  cudaMemcpyDeviceToHost, cuda_stream));

	// block until all GPU-related operations have ended for this inference
	cudaStreamSynchronize(cuda_stream);

Is there a way to squeeze those missing 4 FPS out of the network? I am not a high-performance guy so I may be blind to some inefficiencies in my code.

Thanks,

Yinon

AastaLLL · December 18, 2019, 1:57am

Hi,

Sorry for the late reply.
Let me check this and get back to you soon.

Thanks.

AastaLLL · December 18, 2019, 2:51am

Hi,

The GitHub contains a demo script already. Would you mind to try it first?
https://github.com/NVIDIA-AI-IOT/trt_pose/blob/master/tasks/human_pose/live_demo.ipynb

For best performance, you will need to execute the following commands in order to maximize the device.

sudo nvpmodel -m 0
sudo jetson_clocks

Thanks.

yinondou · December 18, 2019, 8:24am

Tried it now, same performace.

This was the first thing I tried. I tried it through Jupyter Notebook and it started slowly and then it became so slow that it froze my computer (Jetson Nano). I therefore moved the code into a simple python file. The benchmark code did give me 22 FPS. However, when I used the CSICamera module for my CSI camera it failed to open with the following message:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jetcam-0.0.0-py3.6.egg/jetcam/csi_camera.py", line 24, in __init__
RuntimeError: Could not read image from camera.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "live_demo.py", line 61, in <module>
    camera = CSICamera(width=WIDTH, height=HEIGHT, capture_width=1280, capture_height=720, capture_fps=60)
  File "/usr/local/lib/python3.6/dist-packages/jetcam-0.0.0-py3.6.egg/jetcam/csi_camera.py", line 27, in __init__
RuntimeError: Could not initialize camera.  Please see error trace.

Looking at the CSICamera class I see it is based on opening a GStreamer pipeline through OpenCV. It should be noted that in my C++ version I open a GStreamer pipeline through OpenCV and it works fine. Here’s the python script that I ran, based on the ipython notebook (inference only, already created optimized torch model):

import json
import trt_pose.coco
import trt_pose.models
import torch
import torch2trt
import time
import cv2
import torchvision.transforms as transforms
import PIL.Image
import ipywidgets

from torch2trt import TRTModule
from trt_pose.draw_objects import DrawObjects
from trt_pose.parse_objects import ParseObjects
from jetcam.csi_camera import CSICamera
from jetcam.utils import bgr8_to_jpeg
from IPython.display import display

WIDTH = 224
HEIGHT = 224

OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_A_epoch_249_trt.pth'

model_trt = TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

import time

data = torch.zeros((1, 3, HEIGHT, WIDTH)).cuda()

t0 = time.time()
torch.cuda.current_stream().synchronize()
for i in range(50):
    y = model_trt(data)
torch.cuda.current_stream().synchronize()
t1 = time.time()

print(50.0 / (t1 - t0))

mean = torch.Tensor([0.485, 0.456, 0.406]).cuda()
std = torch.Tensor([0.229, 0.224, 0.225]).cuda()
device = torch.device('cuda')

with open('human_pose.json', 'r') as f:
    human_pose = json.load(f)

topology = trt_pose.coco.coco_category_to_topology(human_pose)

def preprocess(image):
    global device
    device = torch.device('cuda')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = PIL.Image.fromarray(image)
    image = transforms.functional.to_tensor(image).to(device)
    image.sub_(mean[:, None, None]).div_(std[:, None, None])
    return image[None, ...]

parse_objects = ParseObjects(topology)
draw_objects = DrawObjects(topology)

camera = CSICamera(width=WIDTH, height=HEIGHT, capture_fps=30)

camera.running = True

image_w = ipywidgets.Image(format='jpeg')

display(image_w)

def execute(change):
    image = change['new']
    data = preprocess(image)
    cmap, paf = model_trt(data)
    cmap, paf = cmap.detach().cpu(), paf.detach().cpu()
    counts, objects, peaks = parse_objects(cmap, paf)#, cmap_threshold=0.15, link_threshold=0.15)
    draw_objects(image, counts, objects, peaks)
    image_w.value = bgr8_to_jpeg(image[:, ::-1, :])

execute({'new': camera.value})

camera.observe(execute, names='value')

AastaLLL · December 19, 2019, 7:20am

Hi,

Let me check this internally and get back to you soon.

Thanks.

AastaLLL · December 31, 2019, 6:07am

Hi,

Sorry for the late.
It looks like this issue turn to how to enable your camera from OpenCV now.

Is your camera a CSI based camera?
If yes, could you help to check if it support 224x224 resolution first?

v4l2-ctl -l

Thanks.

yinondou · December 31, 2019, 8:46am

Yes, I have a CSI camera.

v4l2-ctl -l gives the following output:

Camera Controls

                     group_hold 0x009a2003 (bool)   : default=0 value=0 flags=execute-on-write
                    sensor_mode 0x009a2008 (int64)  : min=0 max=5 step=1 default=0 value=4 flags=slider
                           gain 0x009a2009 (int64)  : min=16 max=170 step=1 default=16 value=101 flags=slider
                       exposure 0x009a200a (int64)  : min=13 max=683709 step=1 default=2495 value=8333 flags=slider
                     frame_rate 0x009a200b (int64)  : min=2000000 max=120000000 step=1 default=120000000 value=60000003 flags=slider
                    bypass_mode 0x009a2064 (intmenu): min=0 max=1 default=0 value=1
                override_enable 0x009a2065 (intmenu): min=0 max=1 default=0 value=1
                   height_align 0x009a2066 (int)    : min=1 max=16 step=1 default=1 value=1
                     size_align 0x009a2067 (intmenu): min=0 max=2 default=0 value=0
               write_isp_format 0x009a2068 (bool)   : default=0 value=0
       sensor_signal_properties 0x009a2069 (u32)    : min=0 max=4294967295 step=1 default=0 [30][18] flags=read-only, has-payload
        sensor_image_properties 0x009a206a (u32)    : min=0 max=4294967295 step=1 default=0 [30][16] flags=read-only, has-payload
      sensor_control_properties 0x009a206b (u32)    : min=0 max=4294967295 step=1 default=0 [30][34] flags=read-only, has-payload
              sensor_dv_timings 0x009a206c (u32)    : min=0 max=4294967295 step=1 default=0 [30][16] flags=read-only, has-payload
               low_latency_mode 0x009a206d (bool)   : default=0 value=0
                   sensor_modes 0x009a2082 (int)    : min=0 max=30 step=1 default=30 value=5 flags=read-only

As I stated before, the framerate I measured was of the inference only, not including grabbing the frame or any pre-processing or post-processing.

AastaLLL · January 13, 2020, 6:49am

Hi,

Sorry for the late reply.

May I know how do you install your OpenCV python package?
Do you use our default version or rebuilt it from source?

More, could you help to print out the support matrix of openCV first?

>>> import cv2
>>> cv2.__version__
>>> print cv2.getBuildInformation()

Thanks.

yinondou · January 13, 2020, 11:31am

Hi,

I installed OpenCV 4 with CUDA according to your script in https://github.com/AastaNV/JEP/blob/master/script/install_opencv4.1.1_Jetson.sh
When looking at /usr/local/lib and /usr/local/include/opencv4 I see the libraries and includes of OpenCV 4.

However, after running your Python commands, it shows that the OpenCV version is 3.2 and that it does not use CUDA. It is possible that my C++ implementation referred to OpenCV 4 and your python script referred to OpenCV 3.2 (and that I have two OpenCV installations. Yikes!). This can also explain why it was extremely slow in Python but reasonable in C++. I will write here the output of the Python commands later when I have access to my computer.

I tried to search for the OpenCV 3.2 installation but didn’t find it yet. How can I find the sources and includes?

AastaLLL · January 14, 2020, 7:31am

Hi,

It looks like there are some different openCV package on your environment.
Please make sure you have added the compiled python package into the PYTHONPATH first.

echo 'export PYTHONPATH=$PYTHONPATH:[<Install Folder>]/opencv-4.1.1/release/python_loader/' >> ~/.bashrc
source ~/.bashrc

Thanks.

nvidia6s51v · February 13, 2020, 10:47am

Would you mind sharing your full code for the C++ equivalent of trt_pose ?

I’m trying to achieve the same thing (after being disappointed with the OpenCV performance)…

tucachmo · March 3, 2021, 8:10am

yinondou:

This was the first thing I tried. I tried it through Jupyter Notebook and it started slowly and then it became so slow that it froze my computer (Jetson Nano). I therefore moved the code into a simple python file. The benchmark code did give me 22 FPS. However, when I used the CSICamera module for my CSI camera it failed to open with the following message:

@yinondou could you please give me your test code? I did rewrite the demo notebook into python file but when it is killed when I save model. I test in Jetson nano, pytorch 1.6.0. Tks

Topic		Replies	Views
Human pose detection model (MoveNet) TensorRT conversion on NVIDIA Jetson Jetson Xavier NX tensorrt , tensorflow , jetson-inference	7	2518	June 16, 2022
X264 and TensorRT sudden reboot (MJPG encoder not affected, but not fast enough) on Jetson Orin Nano Jetson Orin Nano tensorrt , jetson-inference , gstreamer , jetson	52	797	June 18, 2024
Low FPS on Jetson Nano using TensorRT Jetson Nano tensorrt , tensorflow	7	1195	August 27, 2020
Tensorrt Inference in Real time Jetson Nano tensorrt , jetson-inference , gstreamer , python	8	1689	April 12, 2023
Fps is not increasing while doing the inference for segmentation with tensorrt am getting only 1 frame per 2 seconds i need 2 fps Data Science Workbench jetson-inference	3	470	March 13, 2023
[UPDATE 2/25/2021 to include hand pose] Real time human pose estimation on Jetson Nano (22FPS) Jetson Projects	25	20430	October 3, 2024
TensorRT ERROR: pointWiseV2Helpers.h::launchPwgenKernel::532 Cuda Driver (invalid resource handle) Jetson Xavier NX tensorrt , cuda , jetson-inference	3	2043	March 24, 2022
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4329	April 12, 2022
Pose Estimation Runs Extremely Slow on Nano DeepStream SDK jetson-inference	6	694	October 12, 2021
Performance analysis on Jetson Orin Nano 8GB Jetson Nano cudnn	2	283	June 4, 2024

Pose estimation using TRT (trt_pose) - slightly lower framerates than stated in inference

Related topics