Slow real-time performance when running custom YOLOv3 app

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU: Jetson AGX XAVIER
• DeepStream Version: 5.0
• JetPack Version (valid for Jetson only): 4.4
• TensorRT Version: 7.1.3
• Issue Type( questions, new requirements, bugs): Questions

How to get faster real-time performance for app running Yolov3?

I have a custom app that take input from a webcam, use Yolov3 for inference and display result the to the screen. Despite using very low resolution: 480x360, the output jitter a lot for any movement in font of the webcam (e.g., move hand in font of the webcam). When I increase the resolution to 1280x720, the output is almost freezing. I want to know which element(s) in the pipeline is the bottleneck and I would appreciate any advice to improve real-time inference.

The .dot file of my pipeline is shown bellow.

The model I used is from NVIDIA-AI-IOT / deepstream_tlt_apps repo.

The model config:

#YOLOV3 CONFIG
[property]
gpu-id=0
net-scale-factor=1.0
offsets=103.939;116.779;123.68
model-color-format=1
labelfile-path=./nvdsinfer_customparser_yolov3_tlt/yolov3_labels.txt
tlt-encoded-model=./models/yolov3/yolo_resnet18.etlt
tlt-model-key=nvidia_tlt
model-engine-file=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
int8-calib-file=./models/yolov3/cal.bin
uff-input-dims=3;544;960;0
uff-input-blob-name=Input
batch-size=1
##0=FP32, 1=INT8, 2=FP16 mode
network-mode=1
num-detected-classes=4
interval=0
gie-unique-id=1
is-classifier=0
#network-type=0
#no cluster
cluster-mode=3
output-blob-names=BatchedNMS
parse-bbox-func-name=NvDsInferParseCustomYOLOV3TLT
custom-lib-path=./nvdsinfer_customparser_yolov3_tlt/libnvds_infercustomparser_yolov3_tlt.so
[class-attrs-all]
pre-cluster-threshold=0.5
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

Hi @hyperlight,
please use below commands to check the foramts your camera supports.

$ sudo apt-get install v4l-utils

$ v4l2-ctl -d /dev/video0 --list-formats-ext

And, could you run
/usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine and check the inference perf of this model

Hi @mchi,
Output of v4l2-ctl -d /dev/video0 --list-formats-ext:

ioctl: VIDIOC_ENUM_FMT
Index : 0
Type : Video Capture
Pixel Format: ‘YUYV’
Name : YUYV 4:2:2
Size: Discrete 640x480
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 160x120
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 176x144
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 320x176
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 320x240
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 352x288
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 432x240
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 544x288
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 640x360
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 752x416
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 800x448
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 800x600
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 864x480
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 960x544
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 960x720
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1024x576
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1184x656
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1280x720
Interval: Discrete 0.133s (7.500 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1280x960
Interval: Discrete 0.133s (7.500 fps)
Interval: Discrete 0.200s (5.000 fps)
Index : 1
Type : Video Capture
Pixel Format: ‘MJPG’ (compressed)
Name : Motion-JPEG
Size: Discrete 640x480
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 160x120
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 176x144
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 320x176
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 320x240
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 352x288
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 432x240
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 544x288
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 640x360
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 752x416
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 800x448
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 800x600
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 864x480
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 960x544
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 960x720
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1024x576
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1184x656
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1280x720
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
Size: Discrete 1280x960
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)

Output of /usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine:

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
[10/06/2020-09:17:52] [I] === Model Options ===
[10/06/2020-09:17:52] [I] Format: *
[10/06/2020-09:17:52] [I] Model:
[10/06/2020-09:17:52] [I] Output:
[10/06/2020-09:17:52] [I] === Build Options ===
[10/06/2020-09:17:52] [I] Max batch: 1
[10/06/2020-09:17:52] [I] Workspace: 16 MB
[10/06/2020-09:17:52] [I] minTiming: 1
[10/06/2020-09:17:52] [I] avgTiming: 8
[10/06/2020-09:17:52] [I] Precision: FP32
[10/06/2020-09:17:52] [I] Calibration:
[10/06/2020-09:17:52] [I] Safe mode: Disabled
[10/06/2020-09:17:52] [I] Save engine:
[10/06/2020-09:17:52] [I] Load engine: ./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
[10/06/2020-09:17:52] [I] Builder Cache: Enabled
[10/06/2020-09:17:52] [I] NVTX verbosity: 0
[10/06/2020-09:17:52] [I] Inputs format: fp32:CHW
[10/06/2020-09:17:52] [I] Outputs format: fp32:CHW
[10/06/2020-09:17:52] [I] Input build shapes: model
[10/06/2020-09:17:52] [I] Input calibration shapes: model
[10/06/2020-09:17:52] [I] === System Options ===
[10/06/2020-09:17:52] [I] Device: 0
[10/06/2020-09:17:52] [I] DLACore:
[10/06/2020-09:17:52] [I] Plugins:
[10/06/2020-09:17:52] [I] === Inference Options ===
[10/06/2020-09:17:52] [I] Batch: 1
[10/06/2020-09:17:52] [I] Input inference shapes: model
[10/06/2020-09:17:52] [I] Iterations: 10
[10/06/2020-09:17:52] [I] Duration: 3s (+ 200ms warm up)
[10/06/2020-09:17:52] [I] Sleep time: 0ms
[10/06/2020-09:17:52] [I] Streams: 1
[10/06/2020-09:17:52] [I] ExposeDMA: Disabled
[10/06/2020-09:17:52] [I] Spin-wait: Disabled
[10/06/2020-09:17:52] [I] Multithreading: Disabled
[10/06/2020-09:17:52] [I] CUDA Graph: Disabled
[10/06/2020-09:17:52] [I] Skip inference: Disabled
[10/06/2020-09:17:52] [I] Inputs:
[10/06/2020-09:17:52] [I] === Reporting Options ===
[10/06/2020-09:17:52] [I] Verbose: Disabled
[10/06/2020-09:17:52] [I] Averages: 10 inferences
[10/06/2020-09:17:52] [I] Percentile: 99
[10/06/2020-09:17:52] [I] Dump output: Disabled
[10/06/2020-09:17:52] [I] Profile: Disabled
[10/06/2020-09:17:52] [I] Export timing to JSON file:
[10/06/2020-09:17:52] [I] Export output to JSON file:
[10/06/2020-09:17:52] [I] Export profile to JSON file:
[10/06/2020-09:17:52] [I]
[10/06/2020-09:17:55] [I] Starting inference threads
[10/06/2020-09:17:58] [I] Warmup completed 10 queries over 200 ms
[10/06/2020-09:17:58] [I] Timing trace has 239 queries over 3.00813 s
[10/06/2020-09:17:58] [I] Trace averages of 10 runs:
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.9815 ms - Host latency: 20.6992 ms (end to end 20.7134 ms, enqueue 2.43267 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.8745 ms - Host latency: 20.5911 ms (end to end 20.6048 ms, enqueue 2.26504 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.873 ms - Host latency: 20.5902 ms (end to end 20.6048 ms, enqueue 2.15845 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.8841 ms - Host latency: 20.6014 ms (end to end 20.6149 ms, enqueue 2.10379 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.8131 ms - Host latency: 20.5298 ms (end to end 20.544 ms, enqueue 2.14235 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 14.2874 ms - Host latency: 14.8297 ms (end to end 14.8427 ms, enqueue 2.32069 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 10.4989 ms - Host latency: 10.8667 ms (end to end 10.8779 ms, enqueue 2.23696 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 10.1784 ms - Host latency: 10.5355 ms (end to end 10.5477 ms, enqueue 2.139 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79843 ms - Host latency: 10.1391 ms (end to end 10.1503 ms, enqueue 2.07015 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80042 ms - Host latency: 10.1412 ms (end to end 10.1512 ms, enqueue 2.08679 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79639 ms - Host latency: 10.1371 ms (end to end 10.1487 ms, enqueue 2.09757 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79784 ms - Host latency: 10.1386 ms (end to end 10.1481 ms, enqueue 2.06501 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80624 ms - Host latency: 10.1468 ms (end to end 10.1585 ms, enqueue 2.47946 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.81094 ms - Host latency: 10.1514 ms (end to end 10.1627 ms, enqueue 2.12319 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80247 ms - Host latency: 10.1431 ms (end to end 10.155 ms, enqueue 2.20193 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.7968 ms - Host latency: 10.1376 ms (end to end 10.1474 ms, enqueue 2.174 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80544 ms - Host latency: 10.1461 ms (end to end 10.1584 ms, enqueue 2.10437 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.81025 ms - Host latency: 10.151 ms (end to end 10.1618 ms, enqueue 2.07773 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80071 ms - Host latency: 10.1417 ms (end to end 10.1525 ms, enqueue 2.06943 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80681 ms - Host latency: 10.148 ms (end to end 10.1592 ms, enqueue 2.08638 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80215 ms - Host latency: 10.1429 ms (end to end 10.1546 ms, enqueue 2.07375 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.8 ms - Host latency: 10.1409 ms (end to end 10.1509 ms, enqueue 2.08784 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79844 ms - Host latency: 10.1393 ms (end to end 10.1501 ms, enqueue 2.08569 ms)
[10/06/2020-09:17:58] [I] Host Latency
[10/06/2020-09:17:58] [I] min: 10.113 ms (end to end 10.1213 ms)
[10/06/2020-09:17:58] [I] max: 20.9977 ms (end to end 21.0118 ms)
[10/06/2020-09:17:58] [I] mean: 12.5744 ms (end to end 12.5862 ms)
[10/06/2020-09:17:58] [I] median: 10.1527 ms (end to end 10.1633 ms)
[10/06/2020-09:17:58] [I] percentile: 20.9602 ms at 99% (end to end 20.9805 ms at 99%)
[10/06/2020-09:17:58] [I] throughput: 79.4514 qps
[10/06/2020-09:17:58] [I] walltime: 3.00813 s
[10/06/2020-09:17:58] [I] Enqueue Time
[10/06/2020-09:17:58] [I] min: 1.92896 ms
[10/06/2020-09:17:58] [I] max: 3.30591 ms
[10/06/2020-09:17:58] [I] median: 2.1178 ms
[10/06/2020-09:17:58] [I] GPU Compute
[10/06/2020-09:17:58] [I] min: 9.77295 ms
[10/06/2020-09:17:58] [I] max: 20.2773 ms
[10/06/2020-09:17:58] [I] mean: 12.1444 ms
[10/06/2020-09:17:58] [I] median: 9.81201 ms
[10/06/2020-09:17:58] [I] percentile: 20.2404 ms at 99%
[10/06/2020-09:17:58] [I] total compute time: 2.90252 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine

I was trying to look into the doc of trtexec but couldn’t find any info on how to interpret the output shown above, could you provide some guidance to help solving my issue?

Your camera supports two formats. Because there is not MJPEG decoding components in your pipeline, you must be using YUYV 4:2:2 @1280x720, which fps is low - 7.5 or 5 .
And, according to your trtexec output log, the fps your model can support is
fps = 1 second * batch / (Host latency) = 1 second x 1 batch / ~10.2ms = 10 fps

Did you boost the CPU/GPU/EMC clock of Xavier,? If not, please run below commands and measure the trtexec perf again.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Index : 0
Type : Video Capture
Pixel Format: ‘YUYV’
Name : YUYV 4:2:2
..
Size: Discrete 1280x720
Interval: Discrete 0.133s (7.500 fps)
Interval: Discrete 0.200s (5.000 fps)
..
Index : 1
Type : Video Capture
Pixel Format: ‘MJPG’ (compressed)
Name : Motion-JPEG
...
Size: Discrete 1280x720
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)

Hi @mchi,

Thank you for your reply. After boost the CPU/GPU/EMC clock, the current setting is shown below:

SOC family:tegra194 Machine:Jetson-AGX
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu2: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu3: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu4: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1267200 IdleStates: C1=1 c6=1
cpu5: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu6: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
cpu7: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=2265600 CurrentFreq=1190400 IdleStates: C1=1 c6=1
GPU MinFreq=318750000 MaxFreq=1377000000 CurrentFreq=318750000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=408000000 FreqOverride=0
Fan: speed=0
NV Power Mode: MAXN

Re-running the trtexec perf:


[10/07/2020-11:56:51] [I] === Inference Options ===
[10/07/2020-11:56:51] [I] Batch: 1
[10/07/2020-11:56:51] [I] Input inference shapes: model
[10/07/2020-11:56:51] [I] Iterations: 10
[10/07/2020-11:56:51] [I] Duration: 3s (+ 200ms warm up)
[10/07/2020-11:56:51] [I] Sleep time: 0ms
[10/07/2020-11:56:51] [I] Streams: 1
[10/07/2020-11:56:51] [I] ExposeDMA: Disabled
[10/07/2020-11:56:51] [I] Spin-wait: Disabled
[10/07/2020-11:56:51] [I] Multithreading: Disabled
[10/07/2020-11:56:51] [I] CUDA Graph: Disabled
[10/07/2020-11:56:51] [I] Skip inference: Disabled
[10/07/2020-11:56:51] [I] Inputs:

[10/07/2020-11:56:57] [I] Host Latency
[10/07/2020-11:56:57] [I] min: 5.26544 ms (end to end 5.2749 ms)
[10/07/2020-11:56:57] [I] max: 20.643 ms (end to end 20.6511 ms)
[10/07/2020-11:56:57] [I] mean: 6.47423 ms (end to end 6.48417 ms)
[10/07/2020-11:56:57] [I] median: 5.41748 ms (end to end 5.4281 ms)
[10/07/2020-11:56:57] [I] percentile: 20.6161 ms at 99% (end to end 20.6348 ms at 99%)
[10/07/2020-11:56:57] [I] throughput: 154.22 qps
[10/07/2020-11:56:57] [I] walltime: 3.0022 s
[10/07/2020-11:56:57] [I] Enqueue Time
[10/07/2020-11:56:57] [I] min: 1.69434 ms
[10/07/2020-11:56:57] [I] max: 4.0396 ms
[10/07/2020-11:56:57] [I] median: 2.62793 ms
[10/07/2020-11:56:57] [I] GPU Compute
[10/07/2020-11:56:57] [I] min: 5.09668 ms
[10/07/2020-11:56:57] [I] max: 19.9271 ms
[10/07/2020-11:56:57] [I] mean: 6.25913 ms
[10/07/2020-11:56:57] [I] median: 5.23999 ms
[10/07/2020-11:56:57] [I] percentile: 19.9014 ms at 99%
[10/07/2020-11:56:57] [I] total compute time: 2.89798 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=/home/minh/myapps/test_apps/yolov3_app/models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine

Base on your example calculation, the new fps of the model is: 1 * 1 / ~6.5ms = 15fps

I re-run after boosting the clock but it’s still very slow. I have a capfilter which filter stream at lower dimension (640x360) and at 30fps, since the stream go through, does that indicate that my source actually record at 30fps? I try to reduce the dimension at the source element but v4l2src doesn’t have a property for that. Do you know of a way to measure the performance of the pipeline in term of fps?

You can refer to below change to measure the fps

diff --git a/deepstream_test2_app.c b/deepstream_test2_app.c
index 949219f..e8ed8f7 100644
--- a/deepstream_test2_app.c
+++ b/deepstream_test2_app.c
@@ -80,6 +80,12 @@ guint sgie1_unique_id = 2;
 guint sgie2_unique_id = 3;
 guint sgie3_unique_id = 4;
 
+typedef struct _perf_measure{
+    GstClockTime pre_time;
+    GstClockTime total_time;
+    guint count;
+}perf_measure;
+
 /* This is the buffer probe function that we have registered on the sink pad
  * of the OSD element. All the infer elements in the pipeline shall attach
  * their metadata to the GstBuffer, here we will iterate & process the metadata
@@ -96,9 +102,27 @@ osd_sink_pad_buffer_probe (GstPad * pad, GstPadProbeInfo * info,
     NvDsMetaList * l_frame = NULL;
     NvDsMetaList * l_obj = NULL;
     NvDsDisplayMeta *display_meta = NULL;
+    GstClockTime now;
+    perf_measure * perf = (perf_measure *)(u_data);
 
     NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta (buf);
 
+    now = g_get_monotonic_time();
+
+    if (perf->pre_time == GST_CLOCK_TIME_NONE) {
+        perf->pre_time = now;
+        perf->total_time = GST_CLOCK_TIME_NONE;
+    } else {
+	if (perf->total_time == GST_CLOCK_TIME_NONE) {
+	    perf->total_time = (now - perf->pre_time);
+	}
+	else {
+            perf->total_time += (now - perf->pre_time);
+	}
+        perf->pre_time = now;
+        perf->count++;
+    }
+
     for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
       l_frame = l_frame->next) {
         NvDsFrameMeta *frame_meta = (NvDsFrameMeta *) (l_frame->data);
@@ -476,6 +501,13 @@ main (int argc, char *argv[])
   }
 #endif
 
+  perf_measure perf_measure;
+  int src_cnt = 1;  // the source number, set to 1 temporarily 
+
+  perf_measure.pre_time = GST_CLOCK_TIME_NONE;
+  perf_measure.total_time = GST_CLOCK_TIME_NONE;
+  perf_measure.count = 0;
+
   /* Lets add probe to get informed of the meta data generated, we add probe to
    * the sink pad of the osd element, since by that time, the buffer would have
    * had got all the metadata. */
@@ -484,7 +516,7 @@ main (int argc, char *argv[])
     g_print ("Unable to get sink pad\n");
   else
     gst_pad_add_probe (osd_sink_pad, GST_PAD_PROBE_TYPE_BUFFER,
-        osd_sink_pad_buffer_probe, NULL, NULL);
+        osd_sink_pad_buffer_probe, &perf_measure, NULL);
   gst_object_unref (osd_sink_pad);
 
   /* Set the pipeline to "playing" state */
@@ -499,6 +531,7 @@ main (int argc, char *argv[])
   g_print ("Returned, stopping playback\n");
   gst_element_set_state (pipeline, GST_STATE_NULL);
   g_print ("Deleting pipeline\n");
+  g_print ("Average fps %f\n",((perf_measure.count-1)*src_cnt*1000000.0)/perf_measure.total_time);
   gst_object_unref (GST_OBJECT (pipeline));
   g_source_remove (bus_watch_id);
   g_main_loop_unref (loop);

Hi @mchi, thank you for the suggested code, I will check it out and report back.

Hi, is there any sample to measure frame rate of each individual channel streaming based on deepstream-test3?

Hi wdw0908,

Please help to open a new topic for your issue. Thanks