Frame rate drops with more than 5 RTSP streams on a single GPU

I have been trying to benchmark several models on Deepstream on RTSP streams and the results indicate that I cannot run more than 5 real-time streams without a drop in frame rate which becomes significant as I increase the number of streams further.

The model I am using is the ResNet-10. Although the same is observed using the custom YOLO implementation provided with Deepstream.

With ResNet, FPS drops moving from 30 (near real time) with 1 RTSP stream down to 18 per stream when the number of streams is increased to 10. With YOLO, it drops from 30 to 8 as we increase streams from 1 to 10.

Here is the deepstream config file:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=2
kitti-track-output-dir=/nvme/test/metadata_fahad_rtsp_1

#gie-kitti-output-dir=streamscl

[tiled-display]
enable=0
rows=1
columns=1
width=1280
height=720
gpu-id=2
#(0): nvbuf-mem-default - Default memory allocated, specific to particular platform
#(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory
#(2): nvbuf-mem-cuda-device - Allocate Device cuda memory
#(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory
#(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson
#(5): nvbuf-mem-handle - Allocate Surface Handle memory, applicable for Jetson
#(6): nvbuf-mem-system - Allocate Surface System memory, allocated using calloc
nvbuf-memory-type=0

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=4
#uri=file:/dfs/AutomationWorkspace/EncodedVideos/20191016-150001/camera16/cam16Concat_28fps.mp4
uri=rtsp://153.64.131.17/stream
gpu-id=2
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
cudadec-memtype=0

[sink0]
enable=1
type=1
#1=mp4 2=mkv
#1=h264 2=h265 3=mpeg4
## only SW mpeg4 is supported right now.
qos=0
sync=0
gpu-id=2
iframeinterval=10
output-file=/software/Video_Output_Fahad/Out_RTSP_0.mp4
container=1
codec=3
source-id=0

#end

[osd]
enable=1
gpu-id=2
border-width=1
text-size=20
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Arial
process-mode=1
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

[streammux]
gpu-id=2
##Boolean property to inform muxer that sources are live
live-source=1
batch-size=4
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=1000
## Set muxer output width and height
width=1280
height=720
#num-surfaces-per-frame=31
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
gpu-id=2
#model-engine-file=model_b4_int8.engine
labelfile-path=labels.txt
batch-size=4
#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=1
gie-unique-id=1
nvbuf-memory-type=0
config-file=config_infer_primary_yoloV3_Fahad.txt

[tracker]
enable=0
tracker-width=320
tracker-height=180
#ll-lib-file=/usr/local/deepstream/libnvds_mot_iou.so
#ll-lib-file=/opt/nvidia/deepstream/deepstream-4.0/lib/libnvds_mot_klt.so
#ll-lib-file=/usr/local/deepstream/libnvds_mot_klt.so
#ll-lib-file=/usr/local/deepstream/libnvds_tracker.so
ll-lib-file=/opt/nvidia/deepstream/deepstream-4.0/lib/libnvds_nvdcf.so
#ll-config-file required for IOU only
ll-config-file=/root/deepstream_sdk_v4.0_x86_64/samples/configs/deepstream-app/tracker_config.yml
#ll-config-file=iou_config.txt
gpu-id=2
enable-batch-process=1


[tests]
file-loop=0

And here is the inference config file:

[property]
net-scale-factor=1
#0=RGB, 1=BGR
model-color-format=0
custom-network-config=/root/deepstream_sdk_v4.0_x86_64/sources/objectDetector_Yolo/yolov3.cfg
model-file=/root/deepstream_sdk_v4.0_x86_64/sources/objectDetector_Yolo/yolo-obj_20000.weights
#model-engine-file=model_b1_int8.engine
labelfile-path=labels.txt
#int8-calib-file=yolov3-calibration.table.trt5.1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=80
gie-unique-id=1
is-classifier=0
maintain-aspect-ratio=1
parse-bbox-func-name=NvDsInferParseCustomYoloV3
custom-lib-path=/root/deepstream_sdk_v4.0_x86_64/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so

These numbers do not match the claimed throughput on Deepstream. Is there a problem with my config files?

Hi,
Looks like you use x86 PC with dGPU. Please provide information about your machine, Tesla P4 or GeForce 1080,…

Also does it happen in running 5 local video file sources?

What GPU are you running on?
And, can you use INT8 instead of FP16?

Thanks!

We are using a Linux based x86_64 system with a Tesla V100 GPU.
And yes, it also happens when running local video file sources. The FPS drops from 70 per stream to 40 if we increase sources from 1 to 4 when running local video sources using Resnet-10.

INT8 leads to significant accuracy degradation so we want to stick to FP16 for now.

Regarding ResNet-10, do you refer to the resnet10 network - /opt/nvidia/deepstream/deepstream-4.0/samples/models/Primary_Detector/resnet10.prototxt ?

Yes, exactly.

We don’t have v100 in hand, can you use TensorRT tool - trtexec to profile the resnet10.prototxt on your device?
Command is like:

$ trtexec --deploy=resnet10.prototxt --output=“conv2d_cov/Sigmoid” --batch=10 --fp16 --workspace=2048

you can find trtexec in TensorRT package.