Pose Estimation Runs Extremely Slow on Nano

The .mp4 output file created by the app skips a lot of frames. This results in a video that looks like a slideshow. I followed the instructions in this blog. The trt_pose repo says that Nano should be able to run the model at 12-22 FPS.

Am I missing some steps that’s important for speeding up the inference?

I’ve been using the onnx file given in the repo (pose_estimation.onnx) because using my own converted weight would cause the app to get stuck on generating the engine file:

Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1715> [UID = 1]: Trying to create engine from model files
Input filename:   /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream_pose_estimation/models/resnet18_baseline_att_224x224_A_epoch_249.onnx
ONNX IR version:  0.0.6
Opset version:    9
Producer name:    pytorch
Producer version: 1.7
Model version:    0
Doc string:       

An error I got that may be related to this is:

Setting up nvidia-l4t-bootloader (32.4.4-20201027211359) ...
Starting bootloader post-install procedure.
ERROR. Procedure for bootloader update FAILED.
Cannot install package. Exiting...
dpkg: error processing package nvidia-l4t-bootloader (--configure):
 installed nvidia-l4t-bootloader package post-installation script subprocess returned error exit status 1

This happens every time I try to install something via apt.

For reference, I can run deepstream-app samples just fine.

• Hardware Platform (Jetson / GPU)
Jetson Nano Developer Kit 4GB
• DeepStream Version
• JetPack Version (valid for Jetson only)
4.4.1, L4T 32.4.4
• TensorRT Version
• Issue Type (questions, new requirements, bugs)
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
Sample app: DeepStream Human Pose Estimation


Command line used:
sudo ./deepstream-pose-estimation-app /opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_720p.h264 .


Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1715> [UID = 1]: Trying to create engine from model files

This function takes times (in minutes) since it converts the ONNX model into a TensorRT engine.
Do you get the generated engine file in the end?

In case you don’t know, please remember to maximize the Nano with following command first:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks


Thanks for the suggestions. The engine file got generated after a few minutes if I use the pose_estimation.onnx weight in the repo.

I also enabled 10W mode and jetson_clocks, and verified them using jtop. I don’t see any difference in inference performance with or without enabling jetson_clocks.


Could you try the ONNX model with trtexec and share the performance with us?

/usr/src/tensorrt/bin/trtexec --onnx=[model]


Below is the output of trtexec:

/usr/src/tensorrt/bin/trtexec --onnx=pose_estimation.onnx
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=pose_estimation.onnx
[03/12/2021-17:03:47] [I] === Model Options ===
[03/12/2021-17:03:47] [I] Format: ONNX
[03/12/2021-17:03:47] [I] Model: pose_estimation.onnx
[03/12/2021-17:03:47] [I] Output:
[03/12/2021-17:03:47] [I] === Build Options ===
[03/12/2021-17:03:47] [I] Max batch: 1
[03/12/2021-17:03:47] [I] Workspace: 16 MB
[03/12/2021-17:03:47] [I] minTiming: 1
[03/12/2021-17:03:47] [I] avgTiming: 8
[03/12/2021-17:03:47] [I] Precision: FP32
[03/12/2021-17:03:47] [I] Calibration: 
[03/12/2021-17:03:47] [I] Safe mode: Disabled
[03/12/2021-17:03:47] [I] Save engine: 
[03/12/2021-17:03:47] [I] Load engine: 
[03/12/2021-17:03:47] [I] Builder Cache: Enabled
[03/12/2021-17:03:47] [I] NVTX verbosity: 0
[03/12/2021-17:03:47] [I] Inputs format: fp32:CHW
[03/12/2021-17:03:47] [I] Outputs format: fp32:CHW
[03/12/2021-17:03:47] [I] Input build shapes: model
[03/12/2021-17:03:47] [I] Input calibration shapes: model
[03/12/2021-17:03:47] [I] === System Options ===
[03/12/2021-17:03:47] [I] Device: 0
[03/12/2021-17:03:47] [I] DLACore: 
[03/12/2021-17:03:47] [I] Plugins:
[03/12/2021-17:03:47] [I] === Inference Options ===
[03/12/2021-17:03:47] [I] Batch: 1
[03/12/2021-17:03:47] [I] Input inference shapes: model
[03/12/2021-17:03:47] [I] Iterations: 10
[03/12/2021-17:03:47] [I] Duration: 3s (+ 200ms warm up)
[03/12/2021-17:03:47] [I] Sleep time: 0ms
[03/12/2021-17:03:47] [I] Streams: 1
[03/12/2021-17:03:47] [I] ExposeDMA: Disabled
[03/12/2021-17:03:47] [I] Spin-wait: Disabled
[03/12/2021-17:03:47] [I] Multithreading: Disabled
[03/12/2021-17:03:47] [I] CUDA Graph: Disabled
[03/12/2021-17:03:47] [I] Skip inference: Disabled
[03/12/2021-17:03:47] [I] Inputs:
[03/12/2021-17:03:47] [I] === Reporting Options ===
[03/12/2021-17:03:47] [I] Verbose: Disabled
[03/12/2021-17:03:47] [I] Averages: 10 inferences
[03/12/2021-17:03:47] [I] Percentile: 99
[03/12/2021-17:03:47] [I] Dump output: Disabled
[03/12/2021-17:03:47] [I] Profile: Disabled
[03/12/2021-17:03:47] [I] Export timing to JSON file: 
[03/12/2021-17:03:47] [I] Export output to JSON file: 
[03/12/2021-17:03:47] [I] Export profile to JSON file: 
[03/12/2021-17:03:47] [I] 
Input filename:   pose_estimation.onnx
ONNX IR version:  0.0.4
Opset version:    7
Producer name:    pytorch
Producer version: 1.3
Model version:    0
Doc string:       
[03/12/2021-17:03:52] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[03/12/2021-17:04:08] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[03/12/2021-17:05:36] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[03/12/2021-17:05:37] [I] Starting inference threads
[03/12/2021-17:05:40] [I] Warmup completed 3 queries over 200 ms
[03/12/2021-17:05:40] [I] Timing trace has 41 queries over 3.13997 s
[03/12/2021-17:05:40] [I] Trace averages of 10 runs:
[03/12/2021-17:05:40] [I] Average on 10 runs - GPU latency: 76.4785 ms - Host latency: 76.6179 ms (end to end 76.6277 ms, enqueue 4.49985 ms)
[03/12/2021-17:05:40] [I] Average on 10 runs - GPU latency: 76.3778 ms - Host latency: 76.5175 ms (end to end 76.527 ms, enqueue 4.71587 ms)
[03/12/2021-17:05:40] [I] Average on 10 runs - GPU latency: 76.4961 ms - Host latency: 76.6364 ms (end to end 76.6459 ms, enqueue 4.70577 ms)
[03/12/2021-17:05:40] [I] Average on 10 runs - GPU latency: 76.4039 ms - Host latency: 76.5431 ms (end to end 76.5528 ms, enqueue 5.51592 ms)
[03/12/2021-17:05:40] [I] Host Latency
[03/12/2021-17:05:40] [I] min: 76.1094 ms (end to end 76.1189 ms)
[03/12/2021-17:05:40] [I] max: 77.2442 ms (end to end 77.254 ms)
[03/12/2021-17:05:40] [I] mean: 76.5745 ms (end to end 76.584 ms)
[03/12/2021-17:05:40] [I] median: 76.5608 ms (end to end 76.5701 ms)
[03/12/2021-17:05:40] [I] percentile: 77.2442 ms at 99% (end to end 77.254 ms at 99%)
[03/12/2021-17:05:40] [I] throughput: 13.0575 qps
[03/12/2021-17:05:40] [I] walltime: 3.13997 s
[03/12/2021-17:05:40] [I] Enqueue Time
[03/12/2021-17:05:40] [I] min: 1.70435 ms
[03/12/2021-17:05:40] [I] max: 9.69971 ms
[03/12/2021-17:05:40] [I] median: 4.56909 ms
[03/12/2021-17:05:40] [I] GPU Compute
[03/12/2021-17:05:40] [I] min: 75.9717 ms
[03/12/2021-17:05:40] [I] max: 77.1038 ms
[03/12/2021-17:05:40] [I] mean: 76.4349 ms
[03/12/2021-17:05:40] [I] median: 76.4221 ms
[03/12/2021-17:05:40] [I] percentile: 77.1038 ms at 99%
[03/12/2021-17:05:40] [I] total compute time: 3.13383 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=pose_estimation.onnx

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.


Based on the TensorRT profiling data (77ms), the model should be able to reach 12 fps.
May I know the exact performance do you get with deepstream?