Lack of FPS after successfully deploy TLT to Deepstream.

Hello, I have successfully deploy TLT to Deepstream, but when I tried to run, I got a very bad FPS, which is around 4 - 5 fps. I used Resnet18 as a pretrained model and the resolution of datasets is about 1280 X 720 . Do you have any solution? Is it because my datasets’ resolution is too large?

Hi m.billson16,
Which platform did you run, nano or Xaiver? Which detect network did you train, detectnet_v2, SSD or Faster-rcnn?
More, could you paste the running command line along with logs?
Thanks a lot.

I run my program on Jetson Nano using detectnet_v2 (fp16 mode). I’m trying to stream via usb-camera to detect the object that I use for my dataset.

Here is my running command:

deepstream-app -c /home/deepstream/Desktop/TA/source1_usb_dec_infer_resnet_fp16.txt

My config_infer_primary.txt

[property]
gpu-id=0
# preprocessing parameters.
net-scale-factor=0.0039215697906911373
model-color-format=0

# model paths.
labelfile-path=/home/deepstream/Desktop/TA/labels.txt
tlt-encoded-model=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.etlt
tlt-model-key=dWhrajZsbWtobW8wZ2UycmhnaDdqZmw3cGg6MWNhZGU2NTYtNjA5Yy00ZWQ0LTgxZTktYzE4ZmZkOWI4NWI1
input-dims=3;720;1280;0 # where c = number of channels, h = height of the model input, w = width of model input, 0: implies CHW format.
uff-input-blob-name=input_1
batch-size=4 
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=3
interval=0
gie-unique-id=1
is-classifier=0
output-blob-names=output_cov/Sigmoid;output_bbox/BiasAdd
#enable_dbscan=0

[class-attrs-all]
threshold=0.2
group-threshold=1
## Set eps=0.7 and minBoxes for enable-dbscan=1
eps=0.2
#minBoxes=3
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

My code

# Copyright (c) 2018 NVIDIA Corporation.  All rights reserved.
#
# NVIDIA Corporation and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA Corporation is strictly prohibited.

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
#gie-kitti-output-dir=streamscl

[tiled-display]
enable=1
rows=1
columns=1
width=1280
height=720

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=1
camera-width=1280
camera-height=720
camera-fps-n=30
camera-fps-d=1
camera-v4l2-dev-node=0

[sink0]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming 5=Overlay
type=5
sync=0
display-id=0
offset-x=0
offset-y=0
width=0
height=0
overlay-id=1
source-id=0

[sink1]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265 3=mpeg4
codec=1
sync=0
bitrate=2000000
output-file=out.mp4
source-id=0

[sink2]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming 5=Overlay
type=4
#1=h264 2=h265
codec=1
sync=0
bitrate=4000000
# set below properties in case of RTSPStreaming
rtsp-port=8554
udp-port=5400


[osd]
enable=1
border-width=2
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0

[streammux]
##Boolean property to inform muxer that sources are live
live-source=1
batch-size=1
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=480
height=272

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
model-engine-file=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine
#Required to display the PGIE labels, should be added even when using config-file
#property
batch-size=1
#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=0
#Required by the app for SGIE, when used along with config-file property
gie-unique-id=1
config-file=config_infer_primary.txt

[tests]
file-loop=0

Hi m.billson16,
Would you please check or try below items? Thanks.
1)Could you please run and paste the result

$ /usr/src/tensorrt/bin/trtexec --loadEngine=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait
  1. Did you ever run “tlt-prune” and then re-train to get a pruned tlt model? What’s the prune-ratio? You can find it in “tlt-prune” log.
  2. If yes, what’s the size of the pruned tlt model, your etlt model and your resnet18_detector_fp16.engine?
  3. What’s the “-b” setting when you run “tlt-converter”? Can you paste the command line when your run it?
  4. If you already generate trt engine, please consider replace your
tlt-encoded-model=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.etlt
   tlt-model-key=dWhrajZsbWtobW8wZ2UycmhnaDdqZmw3cGg6MWNhZGU2NTYtNjA5Yy00ZWQ0LTgxZTktYzE4ZmZkOWI4NWI1

to

model-engine-file=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine
  1. What’s the fps result when you run deepstream-app against a local file instead of the stream from usb-camera?

Hello Morganh, thank you for your help.

For point number 1, here is the result:

[I] loadEngine: /home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine
[I] fp16
[I] batch: 1
[I] iterations: 20
[I] output: output_cov/Sigmoid,output_bbox/BiasAdd
[I] useSpinWait
[E] [TRT] The engine plan file is not compatible with this version of TensorRT, expecting library version 5.1.6 got 5.1.5, please rebuild.
[I] /home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine has been successfully loaded.
[E] Engine could not be created
&&&& FAILED TensorRT.trtexec # ./trtexec --loadEngine=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait
  1. I had run the tlt-prune to get the pruned model. and the prune ratio is 1, because of the threshold is about 5.2e-6

  2. my model size is about 43 M

  3. my -b is 10
    command line:

tlt-converter $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.etlt \
   -k $KEY \
   -o output_cov/Sigmoid,outtput_bbox/BiasAdd \
   -d 3,720,1280 \
   -m 16 \
   -t fp16\
   -e $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.engine \
   -w 50000000 \
   -b 10
  1. I tried, but I end up with an error

  2. Around 30 fps

do you have any idea?

Hi m.billson16,
Where did you generate resnet18_detector_fp16.engine, from nano?
If not, please download Jetson platform version’s tlt-converter(https://developer.nvidia.com/tlt-converter) and run it in nano in order to generate the trt engine.

Then, to see if item 1 and 5 unblocked.

Hello Morganh, I tried to use tlt-converter in Jetson’s platform. But my Jetson Nano screen become freeze for a long time. And I think the process to convert from etlt to engine still not finished yet. Do you have any solution?

Hello Morganh, I solved this problem.

when I tried to run

/usr/src/tensorrt/bin/trtexec --loadEngine=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

the result is like this:

&&&& RUNNING TensorRT.trtexec # ./trtexec --loadEngine=/home/deepstream/Desktop/TA/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait
[I] loadEngine: /home/deepstream/Desktop/TA/resnet18_detector_fp16.engine
[I] fp16
[I] batch: 1
[I] iterations: 20
[I] output: output_cov/Sigmoid,output_bbox/BiasAdd
[I] useSpinWait
[I] /home/deepstream/Desktop/TA/resnet18_detector_fp16.engine has been successfully loaded.
[I] Average over 10 runs is 230.336 ms (host walltime is 230.424 ms, 99% percentile time is 327.723).
[I] Average over 10 runs is 219.831 ms (host walltime is 219.874 ms, 99% percentile time is 221.005).
[I] Average over 10 runs is 219.561 ms (host walltime is 219.603 ms, 99% percentile time is 220.26).
[I] Average over 10 runs is 220.107 ms (host walltime is 220.15 ms, 99% percentile time is 224.193).
[I] Average over 10 runs is 219.986 ms (host walltime is 220.029 ms, 99% percentile time is 223.398).
[I] Average over 10 runs is 220.043 ms (host walltime is 220.084 ms, 99% percentile time is 223.812).
[I] Average over 10 runs is 220.165 ms (host walltime is 220.206 ms, 99% percentile time is 223.842).
[I] Average over 10 runs is 220.15 ms (host walltime is 220.196 ms, 99% percentile time is 223.773).
[I] Average over 10 runs is 219.909 ms (host walltime is 219.951 ms, 99% percentile time is 222.098).
[I] Average over 10 runs is 224.225 ms (host walltime is 224.27 ms, 99% percentile time is 229.449).
[I] Average over 10 runs is 225.472 ms (host walltime is 225.532 ms, 99% percentile time is 239.276).
[I] Average over 10 runs is 221.678 ms (host walltime is 221.736 ms, 99% percentile time is 230.706).
[I] Average over 10 runs is 223.566 ms (host walltime is 223.61 ms, 99% percentile time is 233.467).
[I] Average over 10 runs is 223.958 ms (host walltime is 224.009 ms, 99% percentile time is 231.804).
[I] Average over 10 runs is 221.063 ms (host walltime is 221.103 ms, 99% percentile time is 224.068).
[I] Average over 10 runs is 224.265 ms (host walltime is 224.306 ms, 99% percentile time is 230.746).
[I] Average over 10 runs is 220.096 ms (host walltime is 220.136 ms, 99% percentile time is 223.987).
[I] Average over 10 runs is 225.106 ms (host walltime is 225.163 ms, 99% percentile time is 232.841).
[I] Average over 10 runs is 222.018 ms (host walltime is 222.079 ms, 99% percentile time is 227.116).
[I] Average over 10 runs is 220.158 ms (host walltime is 220.199 ms, 99% percentile time is 224.125).
&&&& PASSED TensorRT.trtexec # ./trtexec --loadEngine=/home/deepstream/Desktop/TA/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

Do you have any idea?

Hi m.billson16 ,
Firstly, could you please run " sudo nvpmodel -q 0" and " sudo jetson_clocks" for your nano?
Second, can you paste your running command line?
Last, please ctrl+c and run again. It should not freeze for a long time.

Hello Morganh, thanks for the help. Fortunately, I solved the freeze problem.

For sudo nvpmodel -q 0, I got this result

NVPM WARN: fan mode is not set!
NV Power Mode: MAXN
0

But I got nothing when I run sudo jetson_clocks

and also, what running command line should I paste here?

Hi m.billson,
Running “run sudo jetson_clocks” will set CPU/EMC/GPU clocks to max.
Could you please paste the command line when you run “tlt-export”(in x86_64) and “tlt-converter”(in nano)?

Hello Morganh, thanks for the help.

Here is my tlt-export command line

!tlt-export $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
            -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.etlt \
            --outputs output_cov/Sigmoid,output_bbox/BiasAdd \
            -k $KEY \
            --input_dims 3,720,1280 \
            --max_workspace_size 1073741824 \
            --export_module detectnet_v2 \
            --data_type fp16 \
            --batches 10 \
            --cal_batch_size 4 \
            --verbose

and here is the tlt-converter command line in nano

./tlt-converter $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.etlt -k dWhrajZsbWtobW8wZ2UycmhnaDdqZmw3cGg6MWNhZGU2NTYtNjA5Yy00ZWQ0LTgxZTktYzE4ZmZkOWI4NWI1 -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,720,1280 -t fp16 -e /home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine -w 50000000 -b 10

Do you have any idea?

More, please consider to do further experiments:

  1. try to run “tlt-prune” to prune your trained tlt model(since your prune-ratio is 1 which means tlt model is not pruned), then retrain again to get a pruned tlt model. Then export and generate trt engine again to test.
  2. try to resize your 1280x720 dataset offline to a smaller size, and then train–> prune–> retrain–> export --> generate trt engine again for test.

BTW, for my previous item (6), you mentioned that you can get 30fps when you run deepstream-app against a local file. Can you confirm that you run this with the generated trt engine? If yes, that means with usb camera, the fps will drop from 30fps to 4~5 fps.So, is it any issue or bottleneck for the usb camera or something else?

Hello Morganh, thanks for the idea. Actually I do another experiment, I changed the prune threshold into 0.5. but I still get the pruning ratio is 1.0

!tlt-prune -pm $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.tlt \
           -o $USER_EXPERIMENT_DIR/experiment_dir_pruned/ \
           -eq union \
           -pth 0.5 \
           -k $KEY
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-12-03 18:35:25,732 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-12-03 18:35:27.083059: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-03 18:35:27.132341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-03 18:35:27.132846: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7ae6c00 executing computations on platform CUDA. Devices:
2019-12-03 18:35:27.132890: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 950M, Compute Capability 5.0
2019-12-03 18:35:27.154163: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2593765000 Hz
2019-12-03 18:35:27.154801: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7bfed50 executing computations on platform Host. Devices:
2019-12-03 18:35:27.154838: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-03 18:35:27.155037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.75GiB
2019-12-03 18:35:27.155076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-03 18:35:27.155805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-03 18:35:27.155831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-12-03 18:35:27.155845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-12-03 18:35:27.155941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
2019-12-03 18:35:28,668 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2019-12-03 18:35:29,197 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2019-12-03 18:35:49,277 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 1.0

Do you have any idea?

Helo Morganh, I tried to resize my dataset from 1280 x 720 to 640 x 480, I faced the same problem, lack of FPS (around 11-12 fps) , and sometimes the object is my dataset isn’t detected, but the others item that was not being used as my datasets, is detected. Do you have any idea?

Helo Morganh, I tried to resize my dataset from 1280 x 720 to 640 x 480, I faced the same problem, lack of FPS (around 11-12 fps) , and sometimes the object is my dataset isn’t detected, but the others item that was not being used as my datasets, is detected. Do you have any idea?

Did you write script to resize corresponding bboxes(xmin,ymin,xmax,ymax) of all the label text files?

It does not make sense. Could you please try more pth? Thanks.