peopleNet in deepstream-l4t:6.0-samples runs slow on Jetson NX

boazkonig · January 10, 2022, 10:23pm

• Hardware Platform (Jetson / GPU)

Jetson NX on Floyd FLD-BB01 carrier board.
deviceQuery gives the following output:

./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “Xavier”
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 7.2
<…some more stuff, then at the end…>
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

• DeepStream Version

Followed Jetson Setup on this page: Quickstart Guide — DeepStream 6.3 Release documentation
…and used DeepStream SDK according to Method 4 with container: nvcr.io/nvidia/deepstream-l4t:6.0-samples

• JetPack Version (valid for Jetson only)

JetPack 4.6 installed from instructions on this page: How to Install JetPack :: NVIDIA JetPack Documentation
Due to limited (16GB) disk space, this method mentioned was used:

If disk space is limited (for example, when using a 16GB microSD card with a Jetson Nano or Jetson Xavier NX developer kit), use these commands:

sudo apt update
apt depends nvidia-jetpack | awk ‘{print $2}’ | xargs -I {} sudo apt install -y {}

• TensorRT Version

The command:
dpkg -l | grep TensorRT

gives:
ii graphsurgeon-tf 8.0.1-1+cuda10.2 arm64 GraphSurgeon for TensorRT package
ii libnvinfer-bin 8.0.1-1+cuda10.2 arm64 TensorRT binaries
ii libnvinfer-dev 8.0.1-1+cuda10.2 arm64 TensorRT development libraries and headers
ii libnvinfer-doc 8.0.1-1+cuda10.2 all TensorRT documentation
ii libnvinfer-plugin-dev 8.0.1-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer-plugin8 8.0.1-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer-samples 8.0.1-1+cuda10.2 all TensorRT samples
ii libnvinfer8 8.0.1-1+cuda10.2 arm64 TensorRT runtime libraries
ii libnvonnxparsers-dev 8.0.1-1+cuda10.2 arm64 TensorRT ONNX libraries
ii libnvonnxparsers8 8.0.1-1+cuda10.2 arm64 TensorRT ONNX libraries
ii libnvparsers-dev 8.0.1-1+cuda10.2 arm64 TensorRT parsers libraries
ii libnvparsers8 8.0.1-1+cuda10.2 arm64 TensorRT parsers libraries
ii nvidia-container-csv-tensorrt 8.0.1.6-1+cuda10.2 arm64 Jetpack TensorRT CSV file
ii nvidia-tensorrt 4.6-b199 arm64 NVIDIA TensorRT Meta Package
ii python3-libnvinfer 8.0.1-1+cuda10.2 arm64 Python 3 bindings for TensorRT
ii python3-libnvinfer-dev 8.0.1-1+cuda10.2 arm64 Python 3 development package for TensorRT
ii tensorrt 8.0.1.6-1+cuda10.2 arm64 Meta package of TensorRT
ii uff-converter-tf 8.0.1-1+cuda10.2 arm64 UFF converter for TensorRT package

• NVIDIA GPU Driver Version (valid for GPU only)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1

• Issue Type( questions, new requirements, bugs)

When running the peopleNet sample, the speed is not even close the to Nvidia claim of 157fps.
Only about 30 fps is achieved

Also, it seems like the ‘out-of-the-box’ sample setup has some issue as a HUGE amount of warnings are produced.
See attached file which gives all the run-time output

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

Add the peopleNet samples according to the instructions in the README.md file inside the directory: /opt/nvidia/deepstream/deepstream-6.0/samples/configs/tao_pretrained_models

All config files are used as-is without any modifications. Then run the demo with:
deepstream-app -c deepstream_app_source1_peoplenet.txt

Thank you for your support.
peoplenet_run_output.txt (130.3 KB)

AastaLLL · January 11, 2022, 4:09am

Hi,

It is limited to 30 fps due to display rendering.
You can turn off sync and add more input sources to get the maximal fps.

Ex.

diff --git a/deepstream_app_source1_peoplenet.txt b/deepstream_app_source1_peoplenet.txt
index eec215f..0723089 100644
--- a/deepstream_app_source1_peoplenet.txt
+++ b/deepstream_app_source1_peoplenet.txt
@@ -26,8 +26,8 @@ perf-measurement-interval-sec=1

 [tiled-display]
 enable=1
-rows=1
-columns=1
+rows=2
+columns=2
 width=1280
 height=720
 gpu-id=0
@@ -36,13 +36,13 @@ gpu-id=0
 enable=1
 #Type - 1=CameraV4L2 2=URI 3=MultiURI
 type=3
-num-sources=1
+num-sources=4
 uri=file://../../streams/sample_1080p_h265.mp4
 gpu-id=0

 [streammux]
 gpu-id=0
-batch-size=1
+batch-size=4
 batched-push-timeout=40000
 ## Set muxer output width and height
 width=1920
@@ -52,7 +52,7 @@ height=1080
 enable=1
 #Type - 1=FakeSink 2=EglSink 3=File
 type=2
-sync=1
+sync=0
 source-id=0
 gpu-id=0

@@ -69,8 +69,8 @@ font=Arial
 enable=1
 gpu-id=0
 # Modify as necessary
-model-engine-file=../../models/tao_pretrained_models/peopleNet/V2.1/resnet34_peoplenet_pruned.etlt_b1_gpu0_fp16.engine
-batch-size=1
+model-engine-file=../../models/tao_pretrained_models/peopleNet/V2.1/resnet34_peoplenet_pruned_int8.etlt_b4_gpu0_int8.engine
+batch-size=4
 #Required by the app for OSD, not a plugin property
 bbox-border-color0=1;0;0;1
 bbox-border-color1=0;1;1;1

Thanks.

boazkonig · January 12, 2022, 9:07pm

Thank you AastaLLL

I have ensure to optimize device performance with:
$ sudo nvpmodel -m 0
$ sudo jetson_clocks

I have also update the config files as per your suggestion.
Note the following:
With enable=1, sync=0 for [sink0] : fps is ± 60
With enable=0 for [sink0] : fps is ± 91
With enable=0 is set for [tracker] : fps is ±136
(The files are attached again)

Questions:

Maximum 136 fps with tracker disabled is still below 157 fps which Nvidia says can be achieved. What can I do more to get to 157 fps?
The warnings (over 700 warnings) as shown in the previous attached file “peoplenet_run_output.txt” are still given the first time the app runs. These warnings all have the form of:
“WARNING: [TRT]: Missing scale and zero-point for tensor , expect fall back to non-int8 implementation for any layer consuming or producing given tensor”
This warning mentions fall back to non-int8 implementation. What can be done to fix this?
I have not used them, but for information, can the pruned peoplenet models from tlt be used with deepstream6.0 sample apps? (e.g. https://api.ngc.nvidia.com/v2/models/nvidia/tlt_peoplenet/versions/pruned_quantized_v2.1/files/resnet34_peoplenet_pruned_int8.etlt )

Thank you for your support.
config_infer_primary_peoplenet.txt (2.4 KB)
deepstream_app_source1_peoplenet.txt (3.5 KB)

AastaLLL · January 17, 2022, 7:35am

Hi,

1. Have you added the multiple sources as mentioned above?
This will run multiple inputs concurrently.

2. You can ignore the warning directly.
For more information, please check the below topic:

3. Deepstream uses TensorRT as the inference engine.
It will convert the etlt file into TensorRT first and deploy it.