Unstable performance

Environment

TensorRT Version: 7.2.3
GPU Type: rtx 3090 24Gb
Nvidia Driver Version: 460.73.01
CUDA Version: 11.1
CUDNN Version: 8.1.1
Operating System + Version: ubuntu 18.04

Description

I have two yolov4-csp models. First one is official model from darknet project trained on 80 classes. Second one is trained myself on 1 class (only person class). I successfully built engine models fp16 with batch=8 from both of them.

I test both models on following videos (part of output given by ffprobe):

Video 1:
Video: h264 (High), yuv420p, 1920x1080, 30 fps, 30 tbr, 1k tbn, 60 tbc (default)
Video 2:
Video: h264 (Main), yuvj420p(pc, bt709), 1920x1080 [SAR 1:1 DAR 16:9], 30.30 fps, 29.97 tbr, 1k tbn, 59.94 tbc (default)
Video 3:
Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1280x720, 2499 kb/s, 30 fps, 30 tbr, 15360 tbn, 60 tbc (default)

The detection of my model (trained on 1 class) is very good for all videos. My issue is about performance of my model for video3 because the FPS is very unstable and it goes from 40FPS to 85 FPS to 33 FPS to 84 FPS:

**PERF:  FPS 0 (Avg)	FPS 1 (Avg)	FPS 2 (Avg)	FPS 3 (Avg)	FPS 4 (Avg)	FPS 5 (Avg)	FPS 6 (Avg)	FPS 7 (Avg)	
**PERF:  43.29 (43.20)	43.29 (43.20)	43.29 (43.20)	43.29 (43.20)	43.92 (43.82)	43.29 (43.20)	43.29 (43.20)	43.29 (43.20)	
**PERF:  38.37 (40.04)	38.37 (40.04)	38.37 (40.04)	38.37 (40.04)	38.37 (40.20)	38.37 (40.04)	38.37 (40.04)	38.37 (40.04)	
**PERF:  37.21 (38.91)	37.21 (38.91)	37.21 (38.91)	37.21 (38.91)	37.21 (39.00)	37.21 (38.91)	37.21 (38.91)	37.21 (38.91)	
**PERF:  55.03 (43.48)	55.03 (43.48)	55.03 (43.48)	55.03 (43.48)	55.03 (43.57)	55.03 (43.48)	55.03 (43.48)	55.03 (43.48)	
**PERF:  84.25 (52.49)	84.25 (52.49)	84.25 (52.49)	84.25 (52.49)	84.25 (52.60)	84.25 (52.49)	84.25 (52.49)	84.25 (52.49)	
**PERF:  85.92 (58.56)	85.92 (58.56)	85.92 (58.56)	85.92 (58.56)	85.92 (58.67)	85.92 (58.56)	85.92 (58.56)	85.92 (58.56)	
**PERF:  59.41 (58.69)	59.41 (58.69)	59.41 (58.69)	59.41 (58.69)	59.41 (58.79)	59.41 (58.69)	59.41 (58.69)	59.41 (58.69)	
**PERF:  45.51 (56.92)	45.51 (56.92)	45.51 (56.92)	45.51 (56.92)	45.51 (57.00)	45.51 (56.92)	45.51 (56.92)	45.51 (56.92)	
**PERF:  39.51 (54.89)	39.51 (54.89)	39.51 (54.89)	39.51 (54.89)	39.51 (54.95)	39.51 (54.89)	39.51 (54.89)	39.51 (54.89)	
**PERF:  33.49 (52.63)	33.49 (52.63)	33.49 (52.63)	33.49 (52.63)	33.49 (52.68)	33.49 (52.63)	33.49 (52.63)	33.49 (52.63)	
**PERF:  35.73 (51.03)	35.73 (51.03)	35.73 (51.03)	35.73 (51.03)	35.73 (51.07)	35.73 (51.03)	35.73 (51.03)	35.73 (51.03)	
**PERF:  48.86 (50.85)	48.86 (50.85)	48.86 (50.85)	48.86 (50.85)	48.86 (50.89)	48.86 (50.85)	48.86 (50.85)	48.86 (50.85)	
**PERF:  84.23 (53.52)	84.23 (53.52)	84.23 (53.52)	84.23 (53.52)	84.23 (53.56)	84.23 (53.52)	84.23 (53.52)	84.23 (53.52)	
**PERF:  83.24 (55.71)	83.24 (55.71)	83.24 (55.71)	83.24 (55.71)	83.24 (55.75)	83.24 (55.71)	83.24 (55.71)	83.24 (55.71)	
**PERF:  70.73 (56.75)	70.73 (56.75)	70.73 (56.75)	70.73 (56.75)	70.73 (56.79)	70.73 (56.75)	70.73 (56.75)	70.73 (56.75)	
**PERF:  51.72 (56.42)	51.72 (56.42)	51.72 (56.42)	51.72 (56.42)	51.72 (56.46)	51.72 (56.42)	51.72 (56.42)	51.72 (56.42)	
**PERF:  67.37 (57.08)	67.37 (57.08)	67.37 (57.08)	67.37 (57.08)	67.37 (57.12)	67.37 (57.08)	67.37 (57.08)	67.37 (57.08)	
**PERF:  58.60 (57.17)	58.60 (57.17)	58.60 (57.17)	58.60 (57.17)	58.60 (57.20)	58.60 (57.17)	58.60 (57.17)	58.60 (57.17)	
**PERF:  58.07 (57.22)	58.07 (57.22)	58.07 (57.22)	58.07 (57.22)	58.07 (57.26)	58.07 (57.22)	58.07 (57.22)	58.07 (57.22)	
**PERF:  79.39 (58.36)	79.39 (58.36)	79.39 (58.36)	79.39 (58.36)	79.39 (58.39)	79.39 (58.36)	79.39 (58.36)	79.39 (58.36)	

In case of video1 and video2, the FPS is much more stable. This is the result of my model for video1 but output for video2 is very similar:

**PERF:  FPS 0 (Avg)	FPS 1 (Avg)	FPS 2 (Avg)	FPS 3 (Avg)	FPS 4 (Avg)	FPS 5 (Avg)	FPS 6 (Avg)	FPS 7 (Avg)	
**PERF:  78.59 (78.07)	79.26 (78.72)	78.59 (78.07)	80.45 (79.90)	78.59 (78.07)	79.01 (78.49)	78.21 (77.70)	78.21 (77.70)	
**PERF:  81.93 (80.77)	81.93 (81.02)	81.93 (80.77)	81.93 (81.42)	81.93 (80.77)	81.93 (80.92)	81.93 (80.64)	81.93 (80.64)	
**PERF:  82.30 (81.33)	82.30 (81.49)	82.30 (81.33)	82.30 (81.73)	82.30 (81.33)	82.30 (81.42)	82.30 (81.25)	82.30 (81.25)	
**PERF:  81.11 (81.29)	81.11 (81.41)	81.11 (81.29)	81.11 (81.58)	81.11 (81.29)	81.11 (81.36)	81.11 (81.24)	81.11 (81.24)	
**PERF:  81.38 (81.27)	81.38 (81.36)	81.38 (81.27)	81.38 (81.49)	81.38 (81.27)	81.38 (81.32)	81.38 (81.22)	81.38 (81.22)	
**PERF:  78.73 (80.82)	78.73 (80.89)	78.73 (80.82)	78.73 (81.00)	78.73 (80.82)	78.73 (80.86)	78.73 (80.79)	78.73 (80.79)	
**PERF:  81.35 (80.91)	81.35 (80.97)	81.35 (80.91)	81.35 (81.06)	81.35 (80.91)	81.35 (80.95)	81.35 (80.88)	81.35 (80.88)	
**PERF:  77.70 (80.50)	77.70 (80.55)	77.70 (80.50)	77.70 (80.63)	77.70 (80.50)	77.70 (80.53)	77.70 (80.47)	77.70 (80.47)	
**PERF:  80.28 (80.46)	80.28 (80.50)	80.28 (80.46)	80.28 (80.58)	80.28 (80.46)	80.28 (80.49)	80.28 (80.44)	80.28 (80.44)	
**PERF:  78.61 (80.26)	78.21 (80.26)	78.61 (80.26)	78.61 (80.37)	78.61 (80.26)	78.61 (80.29)	78.61 (80.24)	78.61 (80.24)	
**PERF:  78.71 (80.12)	78.71 (80.12)	78.71 (80.12)	78.71 (80.22)	78.71 (80.12)	78.71 (80.15)	78.71 (80.11)	78.71 (80.11)	
**PERF:  76.29 (79.78)	76.49 (79.80)	76.49 (79.80)	76.49 (79.88)	76.49 (79.80)	76.49 (79.82)	76.49 (79.78)	76.49 (79.78)	
**PERF:  79.85 (79.80)	79.85 (79.81)	79.65 (79.80)	79.85 (79.89)	79.85 (79.82)	79.85 (79.83)	79.85 (79.80)	79.85 (79.80)	
**PERF:  77.77 (79.64)	77.77 (79.65)	77.77 (79.64)	77.77 (79.72)	77.77 (79.65)	77.77 (79.67)	77.77 (79.64)	77.77 (79.64)	
**PERF:  77.42 (79.50)	77.42 (79.51)	77.42 (79.50)	77.42 (79.58)	77.42 (79.51)	77.42 (79.52)	77.42 (79.50)	77.42 (79.50)	
**PERF:  76.37 (79.28)	76.37 (79.29)	76.37 (79.28)	76.37 (79.36)	76.17 (79.28)	76.37 (79.31)	76.17 (79.27)	76.37 (79.28)	
**PERF:  77.61 (79.18)	77.41 (79.18)	77.61 (79.18)	77.61 (79.25)	77.61 (79.18)	77.61 (79.21)	77.61 (79.17)	77.41 (79.17)	
**PERF:  76.94 (79.06)	77.14 (79.06)	77.14 (79.07)	77.14 (79.13)	77.14 (79.07)	77.14 (79.09)	77.14 (79.06)	77.14 (79.06)	
**PERF:  80.58 (79.14)	80.58 (79.15)	80.58 (79.15)	80.58 (79.21)	80.58 (79.15)	80.58 (79.17)	80.58 (79.14)	80.58 (79.14)	
**PERF:  80.15 (79.19)	80.15 (79.20)	80.15 (79.20)	80.15 (79.26)	80.15 (79.20)	79.95 (79.21)	80.15 (79.19)	80.15 (79.19)	

However when I run the official model from darknet (trained on 80 classes), the performance is stable for all these three videos and it is super weird.

Do you have any idea why my model is unstable for video3? The main difference between video1, video2 and video3 is the size, where video3 has only 1280x720.

deepstream_app_config:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

[source0]
enable=1
type=3
uri=file:///home/video3.mp4
num-sources=8
gpu-id=0
cudadec-memtype=0

[sink0]
enable=1
#1: Fakesink 2: EGL based windowed sink (nveglglessink)
type=1
sync=0
source-id=0
gpu-id=0
nvbuf-memory-type=0

[streammux]
gpu-id=0
live-source=0
batch-size=8
batched-push-timeout=33333
#video1 and video2
#width=1920
#height=1080
#video3
width=1280
height=720
enable-padding=0
nvbuf-memory-type=0

[primary-gie]
enable=1
gpu-id=0
gie-unique-id=1
nvbuf-memory-type=0
config-file=/opt/nvidia/deepstream/deepstream-5.1/sources/yolov4-csp/config_infer_primary.txt

[tests]
file-loop=0

Hi @fre_deric,

Based on the information you provided, It looks like deepstream related.
We recommend you to please post your concern on deepstream forum to get better help.

Thank you.

1 Like

Thanks @spolisetty !

Hi @fre_deric ,
Could you also share config_infer_primary.txt ? Is there custom post-processing for your yolov4-csp?

I think the pipeline of your test is:
video decoding → streammux → nvinfer (–>post-processing) → fakesink

My guess is: video decoding, streammux, nvinfer (without post-processing) should have the same perf for all videos, so it may be caused by post-processing, e.g. for video3, the detected objects varies a lot compared with other vides which causes varied loading of post-processing, and then cause unstable fps.

1 Like

Hi @mchi ,
thank you for your time!

The config file:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
custom-network-config=/opt/nvidia/deepstream/deepstream-5.1/sources/yolov4/yolov4-csp/my_weights/yolov4-csp.cfg
model-file=/opt/nvidia/deepstream/deepstream-5.1/sources/yolov4/yolov4-csp/my_weights/yolov4-csp_best.weights
model-engine-file=/opt/nvidia/deepstream/deepstream-5.1/sources/yolov4/yolov4-csp/my_weights/model_b8_gpu0_fp16.engine
labelfile-path=/opt/nvidia/deepstream/deepstream-5.1/sources/yolov4/yolov4-csp/my_weights/obj.names
batch-size=8
# 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=1
interval=0
gie-unique-id=1
# Infer Processing Mode 1=Primary Mode 2=Secondary Mode
process-mode=1
# Integer 0: Detector 1: Classifier 2: Segmentation 3: Instance Segmentation
network-type=0
# Integer 0: OpenCV groupRectangles() 1: DBSCAN 2: Non Maximum Suppression 3: DBSCAN + NMS Hybrid 4: No clustering
cluster-mode=4
maintain-aspect-ratio=0
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=/opt/nvidia/deepstream/deepstream-5.1/sources/yolov4/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet

[class-attrs-all]
pre-cluster-threshold=0.25

What do you mean by “custom post-processing”? Do you mean the clustering algorithm?

My pipeline is:
video decoding → streammux → nvinfer → fakesink

My model detects many objects in video3 compare to video1 and video2. So the reason could be that clustering algorithm is slow? In general, the FPS for my model is lower for videos with many detected objects.

by “custom post-processing”, I mean, NvDsInferParseYolo in your case.
I think, NvDsInferParseYolo performance are objects related, with more objects, its processing time will be longer and cause lower fps. You could add debug log to check this.

Thanks!

1 Like

Thank you @mchi , I will try.

There is one more interesting thing.
When I run the pipeline with my model on video1 and video2 (where the performance is stable around 80FPS), the utilization of GPU is still around 90% (by nvidia-smi) and the main CPU deep-stream processes utilizes the CPU for 200% (by htop) all the time.

However, when I run the pipeline with my model on video3 (where the performance is unstable (from 30FPS to 80 FPS to 30FPS) and this video contains many objects), GPU utilization is low around 30% and the interesting thing is that CPU utilization corresponds to FPS. So when FPS is low around 30FPS, the main CPU deep-stream process is utilized around 120%. When FPS is high around 80FPS the main CPU deep-stream process is utilized around 160-200%.

So why does not the CPU deep-stream process run for 200% for the entire video in case of video3? It seems to me that the CPU is not a bottleneck.

My CPU:
AMD ryzen 5 2600, 6 cores, 12 threads

I think that’s possible, the reason is the object number detected.
However, I think, you could look into the object number detected with different models and videos.

1 Like

For video3, the average number of bboxes per frame is 60. While for video1 and video2, the average number of bboxes per frame is 35.

What is the solution? Parallel implementation of nonMaximumSuppression algorithm? Or a more powerful CPU?

CPU based nms has to process the objects in sequence.
One solution could be replacing the GPU based nms with GPU nms which can batch processing the objects.
You can refer to yolov4_deepstream/tensorrt_yolov4 at master · NVIDIA-AI-IOT/yolov4_deepstream · GitHub about how to add NMS into your model, so your post-processing can simply process the data output from the NMS layer.

Certainly, you need to check if the NMS in yolov4_deepstream/tensorrt_yolov4 at master · NVIDIA-AI-IOT/yolov4_deepstream · GitHub is the excatly same as what you require for your model

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.