I am having performance issue running Yolo V3 with Deepstream. When running my pipeline on a Tesla T4 I have about 1000 FPS with the model /opt/nvidia/deepstream/deepstream-6.0/samples/models/Primary_Detector/resnet10.caffemodel
while when using YoloV3
I have only about 30 FPS. I know that YoloV3 is a bigger model and that I can’t expect 1000FPS but still 30 FPS seems too little for a T4.
Looking at nvidia-smi dmons
:
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 70 75 - 100 27 0 0 5000 780
0 69 75 - 100 37 0 0 5000 825
0 70 75 - 100 42 0 0 5000 810
0 70 75 - 100 39 0 1 5000 795
0 69 75 - 100 42 0 0 5000 675
0 71 75 - 100 45 0 2 5000 780
0 67 76 - 100 54 0 0 5000 795
0 65 76 - 100 44 0 3 5000 780
0 67 76 - 100 44 0 0 5000 855
0 67 76 - 100 32 0 9 5000 900
0 70 76 - 100 37 0 0 5000 840
0 67 76 - 100 18 0 26 5000 870
0 71 76 - 100 34 0 0 5000 855
Is sm
the cuda cores utilization?
Here’s how you can reproduce the issue. I am using the official deepstream docker container with python bindings installed.
Here’s how I set up YoloV3 into the container:
mkdir -p /src/models/yoloV3
cd /src/models/yoloV3
# Download data
wget https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg
wget https://pjreddie.com/media/files/yolov3.weights
# Copy calibration from sample
cp /opt/nvidia/deepstream/deepstream-6.0/sources/objectDetector_Yolo/yolov3-calibration.table.trt7.0 /src/models/yoloV3
# Build plugin
cd /opt/nvidia/deepstream/deepstream-6.0/sources/objectDetector_Yolo/
wget https://forums.developer.nvidia.com/uploads/short-url/oezjVVUIuYdfJ8BdTJWNmIBVawl.patch -O DS6.0GA_objectDetector_Yolo_perf_regression.patch
patch -p1 < DS6.0GA_objectDetector_Yolo_perf_regression.patch
CUDA_VER=11.4 make -C nvdsinfer_custom_impl_Yolo
mkdir /src/models/yoloV3/nvdsinfer_custom_impl_Yolo
cp /opt/nvidia/deepstream/deepstream-6.0/sources/objectDetector_Yolo/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so /src/models/yoloV3/nvdsinfer_custom_impl_Yolo
My Yolo V3 configuration is the following (copied from the deepstream 6 container) :
################################################################################
# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
################################################################################
# Following properties are mandatory when engine files are not specified:
# int8-calib-file(Only in INT8), model-file-format
# Caffemodel mandatory properties: model-file, proto-file, output-blob-names
# UFF: uff-file, input-dims, uff-input-blob-name, output-blob-names
# ONNX: onnx-file
#
# Mandatory properties for detectors:
# num-detected-classes
#
# Optional properties for detectors:
# cluster-mode(Default=Group Rectangles), interval(Primary mode only, Default=0)
# custom-lib-path
# parse-bbox-func-name
#
# Mandatory properties for classifiers:
# classifier-threshold, is-classifier
#
# Optional properties for classifiers:
# classifier-async-mode(Secondary mode only, Default=false)
#
# Optional properties in secondary mode:
# operate-on-gie-id(Default=0), operate-on-class-ids(Defaults to all classes),
# input-object-min-width, input-object-min-height, input-object-max-width,
# input-object-max-height
#
# Following properties are always recommended:
# batch-size(Default=1)
#
# Other optional properties:
# net-scale-factor(Default=1), network-mode(Default=0 i.e FP32),
# model-color-format(Default=0 i.e. RGB) model-engine-file, labelfile-path,
# mean-file, gie-unique-id(Default=0), offsets, process-mode (Default=1 i.e. primary),
# custom-lib-path, network-mode(Default=0 i.e FP32)
#
# The values in the config file are overridden by values set through GObject
# properties.
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
#0=RGB, 1=BGR
model-color-format=0
custom-network-config=/src/models/yoloV3/yolov3.cfg
model-file=/src/models/yoloV3/yolov3.weights
#model-engine-file=yolov3_b1_gpu0_int8.engine
labelfile-path=/src/src_deepstream/components/models/yolov3/labels.txt
int8-calib-file=/src/models/yoloV3/yolov3-calibration.table.trt7.0
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1
num-detected-classes=80
gie-unique-id=1
network-type=0
is-classifier=0
## 1=DBSCAN, 2=NMS, 3= DBSCAN+NMS Hybrid, 4 = None(No clustering)
cluster-mode=2
maintain-aspect-ratio=1
parse-bbox-func-name=NvDsInferParseCustomYoloV3
custom-lib-path=/src/models/yoloV3/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet
#scaling-filter=0
#scaling-compute-hw=0
[class-attrs-all]
nms-iou-threshold=0.3
threshold=0.7
Here’s my pipeline. The code is inspired by the rtsp_in_rtsp_out
example from the official python example. Function bus_call
and class GETFPS
can be found in the common folder of the Python bindings. For your convenience I copied their code below.
The pipeline is basically a video decoding + streammux + primary gpu inference engine. At the beginning of the code you can find some simple class to count the FPS. Again, most of the code is taken from the official example rtsp_in_rtsp_out
. As you can see,
batch size is set to the number of sources.
import os
import sys
sys.path.append("../")
import pyds
from comoon.bus_call import bus_call
import math
import gi
gi.require_version("Gst", "1.0")
gi.require_version("GstRtspServer", "1.0")
from gi.repository import GObject, Gst, GstRtspServer
import argparse
import configparser
from common.FPS import GETFPS
from collections import defaultdict
import time
import multiprocessing
import threading
# Counter to count FPS
class Counter:
lock = threading.Lock()
count = defaultdict(int)
started = multiprocessing.Value("d", time.time())
@classmethod
def print_fps_and_reset(cls):
with cls.lock:
# Compute fps
tot_fps = round(sum(cls.count.values()) / (time.time() - cls.started.value), 2)
avg_fps = round(tot_fps / 54) if cls.count else 0
streams_up = len(cls.count)
# Reset
cls.count = defaultdict(int)
cls.started.value = time.time()
print(f"TOT FPS {tot_fps} AVG FPS {avg_fps} STREAMS UP {streams_up}")
# Start monitoring process to print FPS
def monitor(Counter):
while True:
Counter.print_fps_and_reset()
time.sleep(5)
monitoring_thread = threading.Thread(target=monitor, daemon=True, args=(Counter,))
monitoring_thread.start()
# Manager for the Deepstream pipeline
class PipelineManager:
def __init__(self):
# Load cameras configuration
url = "your rtsp camera"
number_of_streams = 54
self.fps_streams = dict()
for index in range(number_of_streams):
self.fps_streams["stream{0}".format(index)] = GETFPS(url)
number_sources = len(self.fps_streams)
print(f"Number of streams {number_sources}")
# Standard GStreamer initialization
GObject.threads_init()
Gst.init(None)
# Create Pipeline element that will form a connection of other elements
print("Creating Pipeline \n")
pipeline = Gst.Pipeline()
if not pipeline:
sys.stderr.write(" Unable to create Pipeline \n")
# Create nvstreammux instance to form batches from one or more sources.
print("Creating nvstreammux \n")
streammux = Gst.ElementFactory.make("nvstreammux", "Stream-muxer")
if not streammux:
sys.stderr.write(" Unable to create NvStreamMux \n")
pipeline.add(streammux)
for index in range(number_of_streams):
# sourcebin (source pad) -> (sinkpad) streammux ->
# Source bin
print("Creating source_bin ", index, " \n ")
source_bin = self.create_source_bin(index, url)
if not source_bin:
sys.stderr.write("Unable to create source bin \n")
pipeline.add(source_bin)
# Sink pad
padname = "sink_%u" % index
sinkpad = streammux.get_request_pad(padname)
if not sinkpad:
sys.stderr.write("Unable to create sink pad bin \n")
# Source pad
srcpad = source_bin.get_static_pad("src")
if not srcpad:
sys.stderr.write("Unable to create src pad bin \n")
srcpad.link(sinkpad)
# Primary GPU Inference Engine (pgie)
print("Creating Pgie \n ")
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
if not pgie:
sys.stderr.write(" Unable to create pgie \n")
# Fakesink
sink = Gst.ElementFactory.make("fakesink", "fakesink")
batch_size = number_sources
streammux.set_property("width", 1920)
streammux.set_property("height", 1080)
streammux.set_property("batch-size", batch_size)
streammux.set_property("batched-push-timeout", 4000000)
streammux.set_property("live-source", 1) # if is_live else 0)
pgie.set_property("config-file-path", "src_deepstream/components/models/yolov3/config_infer_primary_yoloV3.txt")
pgie.set_property("batch-size", batch_size)
pipeline.add(pgie)
pipeline.add(sink)
streammux.link(pgie)
pgie.link(sink)
# create an event loop and feed gstreamer bus messages to it
loop = GObject.MainLoop()
bus = pipeline.get_bus()
bus.add_signal_watch()
bus.connect("message", bus_call, loop)
# Probe to get metadata
tiler_src_pad = pgie.get_static_pad("src")
if not tiler_src_pad:
sys.stderr.write(" Unable to get src pad \n")
else:
tiler_src_pad.add_probe(Gst.PadProbeType.BUFFER, self.tiler_src_pad_buffer_probe, 0)
# Start play back and listen to events
print("Starting pipeline \n")
pipeline.set_state(Gst.State.PLAYING)
try:
loop.run()
except BaseException:
pass
except Exception as e:
print("Exception", str(e))
# cleanup
pipeline.set_state(Gst.State.NULL)
def tiler_src_pad_buffer_probe(self, pad, info, u_data):
"""
tiler_sink_pad_buffer_probe will extract metadata received on OSD sink pad
and update params for drawing rectangle, object information etc.
"""
PGIE_CLASS_ID_VEHICLE = 0
PGIE_CLASS_ID_BICYCLE = 1
PGIE_CLASS_ID_PERSON = 2
PGIE_CLASS_ID_ROADSIGN = 3
frame_number = 0
num_rects = 0
gst_buffer = info.get_buffer()
if not gst_buffer:
print("Unable to get GstBuffer ")
return
# Retrieve batch metadata from the gst_buffer
# Note that pyds.gst_buffer_get_nvds_batch_meta() expects the
# C address of gst_buffer as input, which is obtained with hash(gst_buffer)
batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))
l_frame = batch_meta.frame_meta_list
while l_frame is not None:
try:
# Note that l_frame.data needs a cast to pyds.NvDsFrameMeta
# The casting is done by pyds.NvDsFrameMeta.cast()
# The casting also keeps ownership of the underlying memory
# in the C code, so the Python garbage collector will leave
# it alone.
frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
except StopIteration:
break
with Counter.lock:
Counter.count[frame_meta.pad_index] += 1
frame_number = frame_meta.frame_num
l_obj = frame_meta.obj_meta_list
num_rects = frame_meta.num_obj_meta
obj_counter = {
PGIE_CLASS_ID_VEHICLE: 0,
PGIE_CLASS_ID_PERSON: 0,
PGIE_CLASS_ID_BICYCLE: 0,
PGIE_CLASS_ID_ROADSIGN: 0,
}
while l_obj is not None:
try:
# Casting l_obj.data to pyds.NvDsObjectMeta
obj_meta = pyds.NvDsObjectMeta.cast(l_obj.data)
except StopIteration:
break
obj_counter[obj_meta.class_id] += 1
try:
l_obj = l_obj.next
except StopIteration:
break
# print(
# "Stream Number=",
# frame_meta.pad_index,
# "Frame Number=",
# frame_number,
# "Number of Objects=",
# num_rects,
# "Vehicle_count=",
# obj_counter[self.PGIE_CLASS_ID_VEHICLE],
# "Person_count=",
# obj_counter[self.PGIE_CLASS_ID_PERSON],
# )
# Get frame rate through this probe
self.fps_streams["stream{0}".format(frame_meta.pad_index)].get_fps()
try:
l_frame = l_frame.next
except StopIteration:
break
return Gst.PadProbeReturn.OK
def cb_newpad(self, decodebin, decoder_src_pad, data):
print("In cb_newpad\n")
caps = decoder_src_pad.get_current_caps()
gststruct = caps.get_structure(0)
gstname = gststruct.get_name()
source_bin = data
features = caps.get_features(0)
# Need to check if the pad created by the decodebin is for video and not
# audio.
print("gstname=", gstname)
if gstname.find("video") != -1:
# Link the decodebin pad only if decodebin has picked nvidia
# decoder plugin nvdec_*. We do this by checking if the pad caps contain
# NVMM memory features.
print("features=", features)
if features.contains("memory:NVMM"):
# Get the source bin ghost pad
bin_ghost_pad = source_bin.get_static_pad("src")
if not bin_ghost_pad.set_target(decoder_src_pad):
sys.stderr.write(
"Failed to link decoder src pad to source bin ghost pad\n"
)
else:
sys.stderr.write(
" Error: Decodebin did not pick nvidia decoder plugin.\n")
def decodebin_child_added(self, child_proxy, Object, name, user_data):
print("Decodebin child added:", name, "\n")
if name.find("decodebin") != -1:
Object.connect("child-added", self.decodebin_child_added, user_data)
def create_source_bin(self, id, uri):
print("Creating source bin")
# Create a source GstBin to abstract this bin's content from the rest of the
# pipeline
bin_name = "source-bin-%02d" % id
print(bin_name)
nbin = Gst.Bin.new(bin_name)
if not nbin:
sys.stderr.write(" Unable to create source bin \n")
# Source element for reading from the uri.
# We will use decodebin and let it figure out the container format of the
# stream and the codec and plug the appropriate demux and decode plugins.
uri_decode_bin = Gst.ElementFactory.make("uridecodebin", "uri-decode-bin")
if not uri_decode_bin:
sys.stderr.write(" Unable to create uri decode bin \n")
# We set the input uri to the source element
uri_decode_bin.set_property("uri", uri)
# Connect to the "pad-added" signal of the decodebin which generates a
# callback once a new pad for raw data has been created by the decodebin
uri_decode_bin.connect("pad-added", self.cb_newpad, nbin)
uri_decode_bin.connect("child-added", self.decodebin_child_added, nbin)
# We need to create a ghost pad for the source bin which will act as a proxy
# for the video decoder src pad. The ghost pad will not have a target right
# now. Once the decode bin creates the video decoder and generates the
# cb_newpad callback, we will set the ghost pad target to the video decoder
# src pad.
Gst.Bin.add(nbin, uri_decode_bin)
bin_pad = nbin.add_pad(
Gst.GhostPad.new_no_target(
"src", Gst.PadDirection.SRC))
if not bin_pad:
sys.stderr.write(" Failed to add ghost pad in source bin \n")
return None
return nbin
bus_call.py (copied from deepstream examples)
################################################################################
# SPDX-FileCopyrightText: Copyright (c) 2019-2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
import gi
import sys
gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst
from utils.debug import get_attributes, get_methods
def bus_call(bus, message, loop):
t = message.type
if t == Gst.MessageType.EOS:
sys.stdout.write("End-of-stream\n")
loop.quit()
elif t == Gst.MessageType.WARNING:
err, debug = message.parse_warning()
sys.stderr.write("Warning: %s: %s\n" % (err, debug))
elif t == Gst.MessageType.ERROR:
err, debug = message.parse_error()
sys.stderr.write("Error: %s: %s\n" % (err, debug))
# print(type(err))
# get_attributes(message.src.get_name())
# get_methods(err)
# Camera disconnected
if str(err) == "gst-resource-error-quark: Could not read from resource. (9)":
# TODO: when a camera is disconnected do something
# Do not stop application
return True
loop.quit()
return True
FPS.py (copied from deepstream examples)
################################################################################
# SPDX-FileCopyrightText: Copyright (c) 2019-2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
import time
start_time=time.time()
frame_count=0
class GETFPS:
def __init__(self,stream_id):
global start_time
self.start_time=start_time
self.is_first=True
global frame_count
self.frame_count=frame_count
self.stream_id=stream_id
def get_fps(self):
end_time=time.time()
if(self.is_first):
self.start_time=end_time
self.is_first=False
if(end_time-self.start_time>5):
# print("**********************FPS*****************************************")
# print("Fps of stream",self.stream_id,"is ", float(self.frame_count)/5.0)
self.frame_count=0
self.start_time=end_time
else:
self.frame_count=self.frame_count+1
def print_data(self):
print('frame_count=',self.frame_count)
print('start_time=',self.start_time)
Note: using Yolov3-tiny or another model would not solve the issue since I believe the problem is not related to Yolo but rather to so wrong usage/implementation either on my side or in Deepstream.
You can reproduce the performance issue using the Python example deepstream_rtsp_in_rtsp_out
.
I have two question:
- Do you see anything wrong in the configuration? Any idea why I could have this bad performances? The weird part is that the GPU utilization is really high. Am I computing the FPS in a wrong way?
- I don’t think this is related, but Yolo have an height and width input size of 608x608. Should I set these values for
streammux
too? How are the height and width ofstreammux
related to the one of the model? Should they be equal?