How to get faster real-time performance for app running Yolov3?
I have a custom app that take input from a webcam, use Yolov3 for inference and display result the to the screen. Despite using very low resolution: 480x360, the output jitter a lot for any movement in font of the webcam (e.g., move hand in font of the webcam). When I increase the resolution to 1280x720, the output is almost freezing. I want to know which element(s) in the pipeline is the bottleneck and I would appreciate any advice to improve real-time inference.
Hi @hyperlight,
please use below commands to check the foramts your camera supports.
$ sudo apt-get install v4l-utils
$ v4l2-ctl -d /dev/video0 --list-formats-ext
And, could you run
/usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine and check the inference perf of this model
Output of /usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine:
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
[10/06/2020-09:17:52] [I] === Model Options ===
[10/06/2020-09:17:52] [I] Format: *
[10/06/2020-09:17:52] [I] Model:
[10/06/2020-09:17:52] [I] Output:
[10/06/2020-09:17:52] [I] === Build Options ===
[10/06/2020-09:17:52] [I] Max batch: 1
[10/06/2020-09:17:52] [I] Workspace: 16 MB
[10/06/2020-09:17:52] [I] minTiming: 1
[10/06/2020-09:17:52] [I] avgTiming: 8
[10/06/2020-09:17:52] [I] Precision: FP32
[10/06/2020-09:17:52] [I] Calibration:
[10/06/2020-09:17:52] [I] Safe mode: Disabled
[10/06/2020-09:17:52] [I] Save engine:
[10/06/2020-09:17:52] [I] Load engine: ./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
[10/06/2020-09:17:52] [I] Builder Cache: Enabled
[10/06/2020-09:17:52] [I] NVTX verbosity: 0
[10/06/2020-09:17:52] [I] Inputs format: fp32:CHW
[10/06/2020-09:17:52] [I] Outputs format: fp32:CHW
[10/06/2020-09:17:52] [I] Input build shapes: model
[10/06/2020-09:17:52] [I] Input calibration shapes: model
[10/06/2020-09:17:52] [I] === System Options ===
[10/06/2020-09:17:52] [I] Device: 0
[10/06/2020-09:17:52] [I] DLACore:
[10/06/2020-09:17:52] [I] Plugins:
[10/06/2020-09:17:52] [I] === Inference Options ===
[10/06/2020-09:17:52] [I] Batch: 1
[10/06/2020-09:17:52] [I] Input inference shapes: model
[10/06/2020-09:17:52] [I] Iterations: 10
[10/06/2020-09:17:52] [I] Duration: 3s (+ 200ms warm up)
[10/06/2020-09:17:52] [I] Sleep time: 0ms
[10/06/2020-09:17:52] [I] Streams: 1
[10/06/2020-09:17:52] [I] ExposeDMA: Disabled
[10/06/2020-09:17:52] [I] Spin-wait: Disabled
[10/06/2020-09:17:52] [I] Multithreading: Disabled
[10/06/2020-09:17:52] [I] CUDA Graph: Disabled
[10/06/2020-09:17:52] [I] Skip inference: Disabled
[10/06/2020-09:17:52] [I] Inputs:
[10/06/2020-09:17:52] [I] === Reporting Options ===
[10/06/2020-09:17:52] [I] Verbose: Disabled
[10/06/2020-09:17:52] [I] Averages: 10 inferences
[10/06/2020-09:17:52] [I] Percentile: 99
[10/06/2020-09:17:52] [I] Dump output: Disabled
[10/06/2020-09:17:52] [I] Profile: Disabled
[10/06/2020-09:17:52] [I] Export timing to JSON file:
[10/06/2020-09:17:52] [I] Export output to JSON file:
[10/06/2020-09:17:52] [I] Export profile to JSON file:
[10/06/2020-09:17:52] [I]
[10/06/2020-09:17:55] [I] Starting inference threads
[10/06/2020-09:17:58] [I] Warmup completed 10 queries over 200 ms
[10/06/2020-09:17:58] [I] Timing trace has 239 queries over 3.00813 s
[10/06/2020-09:17:58] [I] Trace averages of 10 runs:
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.9815 ms - Host latency: 20.6992 ms (end to end 20.7134 ms, enqueue 2.43267 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.8745 ms - Host latency: 20.5911 ms (end to end 20.6048 ms, enqueue 2.26504 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.873 ms - Host latency: 20.5902 ms (end to end 20.6048 ms, enqueue 2.15845 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.8841 ms - Host latency: 20.6014 ms (end to end 20.6149 ms, enqueue 2.10379 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 19.8131 ms - Host latency: 20.5298 ms (end to end 20.544 ms, enqueue 2.14235 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 14.2874 ms - Host latency: 14.8297 ms (end to end 14.8427 ms, enqueue 2.32069 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 10.4989 ms - Host latency: 10.8667 ms (end to end 10.8779 ms, enqueue 2.23696 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 10.1784 ms - Host latency: 10.5355 ms (end to end 10.5477 ms, enqueue 2.139 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79843 ms - Host latency: 10.1391 ms (end to end 10.1503 ms, enqueue 2.07015 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80042 ms - Host latency: 10.1412 ms (end to end 10.1512 ms, enqueue 2.08679 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79639 ms - Host latency: 10.1371 ms (end to end 10.1487 ms, enqueue 2.09757 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79784 ms - Host latency: 10.1386 ms (end to end 10.1481 ms, enqueue 2.06501 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80624 ms - Host latency: 10.1468 ms (end to end 10.1585 ms, enqueue 2.47946 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.81094 ms - Host latency: 10.1514 ms (end to end 10.1627 ms, enqueue 2.12319 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80247 ms - Host latency: 10.1431 ms (end to end 10.155 ms, enqueue 2.20193 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.7968 ms - Host latency: 10.1376 ms (end to end 10.1474 ms, enqueue 2.174 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80544 ms - Host latency: 10.1461 ms (end to end 10.1584 ms, enqueue 2.10437 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.81025 ms - Host latency: 10.151 ms (end to end 10.1618 ms, enqueue 2.07773 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80071 ms - Host latency: 10.1417 ms (end to end 10.1525 ms, enqueue 2.06943 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80681 ms - Host latency: 10.148 ms (end to end 10.1592 ms, enqueue 2.08638 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.80215 ms - Host latency: 10.1429 ms (end to end 10.1546 ms, enqueue 2.07375 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.8 ms - Host latency: 10.1409 ms (end to end 10.1509 ms, enqueue 2.08784 ms)
[10/06/2020-09:17:58] [I] Average on 10 runs - GPU latency: 9.79844 ms - Host latency: 10.1393 ms (end to end 10.1501 ms, enqueue 2.08569 ms)
[10/06/2020-09:17:58] [I] Host Latency
[10/06/2020-09:17:58] [I] min: 10.113 ms (end to end 10.1213 ms)
[10/06/2020-09:17:58] [I] max: 20.9977 ms (end to end 21.0118 ms)
[10/06/2020-09:17:58] [I] mean: 12.5744 ms (end to end 12.5862 ms)
[10/06/2020-09:17:58] [I] median: 10.1527 ms (end to end 10.1633 ms)
[10/06/2020-09:17:58] [I] percentile: 20.9602 ms at 99% (end to end 20.9805 ms at 99%)
[10/06/2020-09:17:58] [I] throughput: 79.4514 qps
[10/06/2020-09:17:58] [I] walltime: 3.00813 s
[10/06/2020-09:17:58] [I] Enqueue Time
[10/06/2020-09:17:58] [I] min: 1.92896 ms
[10/06/2020-09:17:58] [I] max: 3.30591 ms
[10/06/2020-09:17:58] [I] median: 2.1178 ms
[10/06/2020-09:17:58] [I] GPU Compute
[10/06/2020-09:17:58] [I] min: 9.77295 ms
[10/06/2020-09:17:58] [I] max: 20.2773 ms
[10/06/2020-09:17:58] [I] mean: 12.1444 ms
[10/06/2020-09:17:58] [I] median: 9.81201 ms
[10/06/2020-09:17:58] [I] percentile: 20.2404 ms at 99%
[10/06/2020-09:17:58] [I] total compute time: 2.90252 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
I was trying to look into the doc of trtexec but couldn’t find any info on how to interpret the output shown above, could you provide some guidance to help solving my issue?
Your camera supports two formats. Because there is not MJPEG decoding components in your pipeline, you must be using YUYV 4:2:2 @1280x720, which fps is low - 7.5 or 5 .
And, according to your trtexec output log, the fps your model can support is
fps = 1 second * batch / (Host latency) = 1 second x 1 batch / ~10.2ms = 10 fps
Did you boost the CPU/GPU/EMC clock of Xavier,? If not, please run below commands and measure the trtexec perf again.
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Index : 0
Type : Video Capture
Pixel Format: ‘YUYV’
Name : YUYV 4:2:2
..
Size: Discrete 1280x720
Interval: Discrete 0.133s (7.500 fps)
Interval: Discrete 0.200s (5.000 fps)
..
Index : 1
Type : Video Capture
Pixel Format: ‘MJPG’ (compressed)
Name : Motion-JPEG
...
Size: Discrete 1280x720
Interval: Discrete 0.033s (30.000 fps)
Interval: Discrete 0.040s (25.000 fps)
Interval: Discrete 0.050s (20.000 fps)
Interval: Discrete 0.067s (15.000 fps)
Interval: Discrete 0.100s (10.000 fps)
Interval: Discrete 0.200s (5.000 fps)
…
[10/07/2020-11:56:51] [I] === Inference Options ===
[10/07/2020-11:56:51] [I] Batch: 1
[10/07/2020-11:56:51] [I] Input inference shapes: model
[10/07/2020-11:56:51] [I] Iterations: 10
[10/07/2020-11:56:51] [I] Duration: 3s (+ 200ms warm up)
[10/07/2020-11:56:51] [I] Sleep time: 0ms
[10/07/2020-11:56:51] [I] Streams: 1
[10/07/2020-11:56:51] [I] ExposeDMA: Disabled
[10/07/2020-11:56:51] [I] Spin-wait: Disabled
[10/07/2020-11:56:51] [I] Multithreading: Disabled
[10/07/2020-11:56:51] [I] CUDA Graph: Disabled
[10/07/2020-11:56:51] [I] Skip inference: Disabled
[10/07/2020-11:56:51] [I] Inputs:
…
[10/07/2020-11:56:57] [I] Host Latency
[10/07/2020-11:56:57] [I] min: 5.26544 ms (end to end 5.2749 ms)
[10/07/2020-11:56:57] [I] max: 20.643 ms (end to end 20.6511 ms)
[10/07/2020-11:56:57] [I] mean: 6.47423 ms (end to end 6.48417 ms)
[10/07/2020-11:56:57] [I] median: 5.41748 ms (end to end 5.4281 ms)
[10/07/2020-11:56:57] [I] percentile: 20.6161 ms at 99% (end to end 20.6348 ms at 99%)
[10/07/2020-11:56:57] [I] throughput: 154.22 qps
[10/07/2020-11:56:57] [I] walltime: 3.0022 s
[10/07/2020-11:56:57] [I] Enqueue Time
[10/07/2020-11:56:57] [I] min: 1.69434 ms
[10/07/2020-11:56:57] [I] max: 4.0396 ms
[10/07/2020-11:56:57] [I] median: 2.62793 ms
[10/07/2020-11:56:57] [I] GPU Compute
[10/07/2020-11:56:57] [I] min: 5.09668 ms
[10/07/2020-11:56:57] [I] max: 19.9271 ms
[10/07/2020-11:56:57] [I] mean: 6.25913 ms
[10/07/2020-11:56:57] [I] median: 5.23999 ms
[10/07/2020-11:56:57] [I] percentile: 19.9014 ms at 99%
[10/07/2020-11:56:57] [I] total compute time: 2.89798 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=/home/minh/myapps/test_apps/yolov3_app/models/yolov3/yolo_resnet18.etlt_b1_gpu0_int8.engine
Base on your example calculation, the new fps of the model is: 1 * 1 / ~6.5ms = 15fps
I re-run after boosting the clock but it’s still very slow. I have a capfilter which filter stream at lower dimension (640x360) and at 30fps, since the stream go through, does that indicate that my source actually record at 30fps? I try to reduce the dimension at the source element but v4l2src doesn’t have a property for that. Do you know of a way to measure the performance of the pipeline in term of fps?