Problems with Onnx model zoo -> trtexec -> DeepStream 6.0 pipeline

Description

Onnx models from the model zoo produce poor results in deepstream (low fps, stuttering output - actual annotations are good)

Hi, we’re looking to run yolov4 object detection models in deepstream. Unfortunately its not working at the min. Our process at the min is:

  • Download a yolov4 model from onnx model zoo GitHub - onnx/models: A collection of pre-trained, state-of-the-art models in the ONNX format
  • Convert it with trtexec on the target device (Jetson NX running JP4.6, DS6.0):
    /usr/src/tensorrt/bin/trtexec --onnx=/data/models/yolov4_onnx.onnx --saveEngine=/data/models/yolov4_coco_dynamic_kxm.engine --explicitBatch --minShapes=input:1x3x416x416 --optShapes=input:4x3x416x416 --maxShapes=input:16x3x416x416
    This works, and I can test it with:
    /usr/src/tensorrt/bin/trtexec --loadEngine=/data/models/yolov4_coco_kxm.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph
    All okay.
    However when I come to run it in DeepStream using mp4 inputs, the output is stutters (runs for maybe 0.5-1seconds and then stops for a bit), and the fps is very low (10-15fps when the input videos are 30fps)

This is my config:
[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
#gie-kitti-output-dir=streamscl

output display details

[tiled-display]
enable=1
rows=2
columns=2
width=1920
height=1080
gpu-id=0
#(0): nvbuf-mem-default - Default memory allocated, specific to particular platform
#(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory, applicable for Tesla
#(2): nvbuf-mem-cuda-device - Allocate Device cuda memory, applicable for Tesla
#(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory, applicable for Tesla
#(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson
nvbuf-memory-type=0

mp4 video source

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP
type=3
uri=file:///data/videos-test/RowdenCarpark2.mp4
num-sources=2
gpu-id=0
cudadec-memtype=0
source-id=0
camera-width=1280
camera-height=720

[source1]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=3
uri=file:///opt/nvidia/deepstream/deepstream-6.0/samples/streams/sample_1080p_h264.mp4
#uri=file:///home/tushar/sample_0_720p.mp4
num-sources=2
gpu-id=0
nvbuf-memory-type=0

rtsp video source

[source2]
enable=0
type=4
#latency=30000
#drop-on-latency=false
#drop-frame-interval=3
buffer-size=5000000
uri=
cudadec-memtype=0
source-id=0

rtsp video out

[sink0]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming
type=4
#1=h264 2=h265
codec=1
#encoder type 0=Hardware 1=Software
enc-type=0
sync=0
bitrate=10000000
#bitrate=2700000
#H264 Profile - 0=Baseline 2=Main 4=High
#H265 Profile - 0=Main 1=Main10
profile=0

set below properties in case of RTSPStreaming

rtsp-port=8556
udp-port=5400
#source-id=0

mp4 out

[sink1]
enable=1
type=3
#1=mp4 2=mkv
container=1
enc-type=0
#1=h264 2=h265 3=mpeg4

only SW mpeg4 is supported right now.

codec=1
sync=1
bitrate=4000000
profile=0
output-file=/data/videos-out/21112023_093556_RowdenCarpark2.mp4
source-id=0

[sink2]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File 4=UDPSink 5=nvoverlaysink 6=MsgConvBroker
type=6
msg-conv-config=redis_msg_config.txt
#(0): PAYLOAD_DEEPSTREAM - Deepstream schema payload
#(1): PAYLOAD_DEEPSTREAM_MINIMAL - Deepstream schema payload minimal
#(256): PAYLOAD_RESERVED - Reserved type
#(257): PAYLOAD_CUSTOM - Custom schema payload
msg-conv-payload-type=0
msg-conv-msg2p-new-api=1
msg-conv-frame-interval=100
#msg-broker-proto-lib=/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_kafka_proto.so
msg-broker-proto-lib=/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_redis_proto.so
#Provide your msg-broker-conn-str here
msg-broker-conn-str=localhost;6379
#topic=deepstream_detection_messages
topic=metadata
#Optional:
msg-broker-config=/opt/nvidia/deepstream/deepstream/sources/libs/redis_protocol_adaptor/cfg_redis.txt

on screen display

[osd]
enable=1
gpu-id=0
border-width=1
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Arial
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

stream mux - forms batches of frames from multiple input sources

[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=1
batch-size=4
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=33333

Set muxer output width and height

width=1280
height=720
#enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0

If set to TRUE, system timestamp will be attached as ntp timestamp

If set to FALSE, ntp timestamp from rtspsrc, if available, will be attached

attach-sys-ts-as-ntp=1

primary gpu inference engine (model)

[primary-gie]
enable=1
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;1;1;1
bbox-border-color3=0;1;0;1
nvbuf-memory-type=0
config-file=detector_config.txt

[tracker]
enable=1

For NvDCF and DeepSORT tracker, tracker-width and tracker-height must be a multiple of 32, respectively

tracker-width=320
tracker-height=256
ll-lib-file=/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_nvmultiobjecttracker.so

ll-config-file required to set different tracker types

ll-config-file=/opt/nvidia/deepstream/deepstream-DEEPSTREAM_VER/samples/configs/deepstream-app/config_tracker_IOU.yml

ll-config-file=/opt/nvidia/deepstream/deepstream-6.0/samples/configs/deepstream-app/config_tracker_NvDCF_perf.yml

ll-config-file=/opt/nvidia/deepstream/deepstream-DEEPSTREAM_VER/samples/configs/deepstream-app/config_tracker_NvDCF_accuracy.yml

ll-config-file=/opt/nvidia/deepstream/deepstream-DEEPSTREAM_VER/samples/configs/deepstream-app/config_tracker_DeepSORT.yml

gpu-id=0
enable-batch-process=1
enable-past-frame=1
display-tracking-id=1

secondary gpu inference engine (model)

[secondary-gie]
enable=0
gpu-id=0
batch-size=1

0=FP32, 1=INT8, 2=FP16 mode

nvbuf-memory-type=0
config-file=classifier_config.txt
gie-unique-id=2
operate-on-gie-id=1

[tests]
file-loop=0

And this is my detector_config.txt:
[property]
gpu-id=0
model-engine-file=/data/models/yolov4_coco_kxm.engine
batch-size=4
gie-unique-id=1
maintain-aspect-ratio=1
symmetric-padding=0
network-mode=0
process-mode=1
network-type=0
interval=4
engine-create-func-name=NvDsInferYoloCudaEngineGet
force-implicit-batch-dim=1

from models.json

net-scale-factor=0.003921569790691137
labelfile-path=/data/labels/coco.txt
num-detected-classes=80
cluster-mode=3
#parse-bbox-func-name=NvDsInferParseCustomYoloV3
#custom-lib-path=/opt/nvidia/deepstream/deepstream-6.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
#infer-dims=3;544;960
#output-blob-names=BatchedNMS
#parse-bbox-func-name=NvDsInferParseCustomBatchedNMSTLT
#custom-lib-path=/opt/nvidia/deepstream/deepstream-6.0/sources/deepstream_tlt_apps/post_processor/libnvds_infercustomparser_tlt.so
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=/opt/nvidia/deepstream/deepstream-6.0/sources/DeepStream-Yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
#parse-bbox-func-name=NvDsInferParseCustomYoloV4
#custom-lib-path=/opt/nvidia/deepstream/deepstream-6.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
#custom-lib-path=/yolo_deepstream/deepstream_yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
model-color-format=0

[class-attrs-all]
topk=20
nms-iou-threshold=0.5
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

from models.json

pre-cluster-threshold=0.7

Thanks in advance!

Environment

TensorRT Version:
v8.0.1
GPU Type:
Jetson NX
Nvidia Driver Version:
CUDA Version:
10.2
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Moving this to Deepstream Forum
Thanks

  1. please refer to topic for performance analysis. please refer to topic for fps checking.
  2. to narrow down this issue, please set fakesink to check if the output can get high fps.

I’ve trialled the fps improvements:

Using enable-perf-measurement=1
**PERF: 10.45 (10.44) 10.45 (10.46) 10.45 (10.46) 10.45 (10.46)

Using export NVDS_ENABLE_LATENCY_MEASUREMENT=1, I see:
BATCH-NUM = 0**
Batch meta not found for buffer 0x7f24121c30
BATCH-NUM = 1**
Batch meta not found for buffer 0x7f1401ac10

I’m setting batched-push-timeout to 1/max_fps (33333)
Height and width in streammux are set to the input video’s height and width

Looking at jtop, the GPU usage appears to sit at >99% the majority of the time, and drops down once every few seconds

Setting qos=0 in sink0 appears to make no difference

One more thing. The bounding boxes don’t also print out object class and ID, which they do when I’m using the ultralytics yolov8 with batch size of 1 - how can I get this working?

Using just a fakesink block nothing seems to change but the latency measurement gives accurate results:

BATCH-NUM = 133**
Source id = 0 Frame_num = 133 Frame latency = 1701173680300.191895 (ms)
Source id = 3 Frame_num = 133 Frame latency = 1891.279785 (ms)
Source id = 2 Frame_num = 133 Frame latency = 1888.423828 (ms)
Source id = 1 Frame_num = 133 Frame latency = 1894.659912 (ms)

BATCH-NUM = 134**
Source id = 3 Frame_num = 134 Frame latency = 1701173680308.073975 (ms)
Source id = 2 Frame_num = 134 Frame latency = 1890.655029 (ms)
Source id = 0 Frame_num = 134 Frame latency = 1884.128906 (ms)
Source id = 1 Frame_num = 134 Frame latency = 1886.737061 (ms)
**PERF: 10.54 (10.70) 10.54 (10.70) 10.54 (10.70) 10.54 (10.70)

Here’s the full component latency measurements:
BATCH-NUM = 34**
Comp name = nvosd0 in_system_timestamp = 1701173818236.645020 out_system_timestamp = 1701173818238.068115 component latency= 1.423096
Comp name = osd_conv in_system_timestamp = 1701173818232.974121 out_system_timestamp = 1701173818236.451904 component latency= 3.477783
Comp name = tiled_display_tiler in_system_timestamp = 1701173818224.996094 out_system_timestamp = 1701173818231.325928 component latency= 6.329834
Comp name = tracking_tracker in_system_timestamp = 1701173818201.530029 out_system_timestamp = 1701173818217.559082 component latency= 16.029053
Comp name = primary_gie in_system_timestamp = 1701173818200.653076 out_system_timestamp = 1701173818201.501953 component latency= 0.848877
Comp name = nvstreammux-src_bin_muxer source_id = 3 pad_index = 3 frame_num = 34 in_system_timestamp = 1701173817771.239990 out_system_timestamp = 1701173818200.542969 component_latency = 429.302979
Comp name = nvv4l2decoder3 in_system_timestamp = 1701173816769.830078 out_system_timestamp = 1701173817739.690918 component latency= 969.860840
Comp name = nvstreammux-src_bin_muxer source_id = 2 pad_index = 2 frame_num = 33 in_system_timestamp = 1701173817767.345947 out_system_timestamp = 1701173818200.541992 component_latency = 433.196045
Comp name = nvv4l2decoder1 in_system_timestamp = 1701173816768.810059 out_system_timestamp = 1701173817738.277100 component latency= 969.467041
Comp name = nvstreammux-src_bin_muxer source_id = 1 pad_index = 1 frame_num = 33 in_system_timestamp = 1701173817772.541992 out_system_timestamp = 1701173818200.541992 component_latency = 428.000000
Comp name = nvv4l2decoder2 in_system_timestamp = 1701173816768.469971 out_system_timestamp = 1701173817736.697998 component latency= 968.228027
Comp name = nvstreammux-src_bin_muxer source_id = 0 pad_index = 0 frame_num = 33 in_system_timestamp = 1701173817769.280029 out_system_timestamp = 1701173818200.541016 component_latency = 431.260986
Comp name = nvv4l2decoder0 in_system_timestamp = 1701173816767.169922 out_system_timestamp = 1701173817735.011963 component latency= 967.842041
Source id = 3 Frame_num = 34 Frame latency = 1701173818238.372070 (ms)
Source id = 2 Frame_num = 33 Frame latency = 1468.541992 (ms)
Source id = 1 Frame_num = 33 Frame latency = 1469.562012 (ms)
Source id = 0 Frame_num = 33 Frame latency = 1469.902100 (ms)

It looks like the PGIE block is the issue

it is performance issue. please execute the following command-line to improve performance. please refer to the doc.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

could you share the result of “ll /data/labels/coco.txt” and the label file coco.txt?

I’ve done the nvpmodel and jetson_clocks commands now and retested but I’m getting the same result (10-15fps)

I did sudo chmod 777 -R /data/labels/coco.txt
-rwxrwxrwx 1 1000 1000 621 Nov 24 16:59 /data/labels/coco.txt*
and reran and I got the same result (no bounding box labels)

could you share the label file coco.txt? maybe nvinfer failed to parse that file.

labels.txt (621 Bytes)

I’m actually having the same issue with the sample resnet10 model using /opt/nvidia/deepstream/deepstream-6.0/samples/models/Primary_Detector/labels.txt and /opt/nvidia/deepstream/deepstream-6.0/samples/models/Primary_Detector/resnet10.caffemodel
But it works fine with ultralytics yolov8 using this same label file

labels.txt is fine. can you provide the whole project(model, cfg), I will have a try. and nvinfer and low level lib are opensource. you can add log to check if interested.

Thanks, the configs are pasted above, the model is here https://github.com/onnx/models/blob/main/vision/object_detection_segmentation/yolov4/model/yolov4.onnx

could you share libnvdsinfer_custom_impl_Yolo.so which includes NvDsInferParseCustomYoloV4? you can use forum private email. please click forum avatar-> email.

I’m using NvDsInferParseYolo from Deepstream-Yolo (GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models). I’ll email you

I can’t see an email for you. I’ve built libnvdsinfer_custom_impl_Yolo.so using the instructions here for DS6.0/CUDA10.2

testing yolov4.onnx model you shared in DeepStream-Yolo project, I can’t get bboxes. please help to check if processing parameters are correct.
config_infer_primary_yoloV4.txt (1.2 KB)
labels.txt (621 Bytes)

try net-scale-factor=0.003921569790691137?

Nonetheless I think onnx models converted from tf wont work without a custom parser/conversion with the NCHW format because of this How to resolve the error: RGB/BGR input format specified but network input channels is not 3?

So I’ll close this topic and open one for the labelled bounding boxes issue