Custom YOLOv4 Model Performance

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): Jetson Xavier NX
• DeepStream Version: 6.0.1
• JetPack Version (valid for Jetson only): 4.6.1
• TensorRT Version: 8.2.3
• Issue Type( questions, new requirements, bugs): performance

Hello, I am currently running a custom YOLOv4 model via Deepstream. I have implemented a custom bounding box function and have seen the model produce bounding boxes on a video stream from my USB camera. However, when running the pipeline with my custom model, the performance is quite bad. My video stream goes from 30 fps to ~8fps, the GPU usage in tegrastats shoots up to 99% and stays there, and even on the highest power mode I get a system warning that the device is running overcurrent.

Here is my gstreamer pipeline:

gst-launch-1.0 nvarguscamerasrc num-buffers=600 bufapi-version=1 ! "video/x-raw(memory:NVMM), format=(string)NV12,width=1920,height=1080" ! queue ! mux.sink_0 nvstreammux name=mux width=1920 height=1080 batched-push-timeout=40000 batch-size=1 ! queue ! nvinfer config-file-path=config_infer_primary_yoloV4.txt batch-size=1 ! nvvideoconvert ! nvdsosd ! nvvideoconvert ! nvoverlaysink sync=false

and the contents of config_infer_primary_yoloV4.txt:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
#0=RGB, 1=BGR
model-color-format=0
model-engine-file=<my-engine-file>
#model-engine-file=<my-engine-file>
labelfile-path=labels.txt
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=0
num-detected-classes=14
gie-unique-id=1
network-type=0
## 0=Group Rectangles, 1=DBSCAN, 2=NMS, 3= DBSCAN+NMS Hybrid, 4 = None(No clustering)
cluster-mode=2
maintain-aspect-ratio=1
parse-bbox-func-name=NvDsInferParseCustomYoloV4
custom-lib-path=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
#scaling-filter=0
#scaling-compute-hw=0

[class-attrs-all]
nms-iou-threshold=0.3
pre-cluster-threshold=0.7

Here are the benchmark details when I generate the engine file from my ONNX model using trtexec

[03/24/2022-16:02:08] [I] === Trace details ===
[03/24/2022-16:02:08] [I] Trace averages of 10 runs:
[03/24/2022-16:02:08] [I] Average on 10 runs - GPU latency: 161.096 ms - Host latency: 161.321 ms (end to end 161.329 ms, enqueue 6.72987 ms)
[03/24/2022-16:02:08] [I] Average on 10 runs - GPU latency: 159.599 ms - Host latency: 159.823 ms (end to end 159.833 ms, enqueue 7.01724 ms)
[03/24/2022-16:02:08] [I] 
[03/24/2022-16:02:08] [I] === Performance summary ===
[03/24/2022-16:02:08] [I] Throughput: 6.23187 qps
[03/24/2022-16:02:08] [I] Latency: min = 158.026 ms, max = 171.872 ms, mean = 160.455 ms, median = 158.344 ms, percentile(99%) = 171.872 ms
[03/24/2022-16:02:08] [I] End-to-End Host Latency: min = 158.036 ms, max = 171.882 ms, mean = 160.464 ms, median = 158.353 ms, percentile(99%) = 171.882 ms
[03/24/2022-16:02:08] [I] Enqueue Time: min = 5.49029 ms, max = 8.14746 ms, mean = 6.838 ms, median = 6.90735 ms, percentile(99%) = 8.14746 ms
[03/24/2022-16:02:08] [I] H2D Latency: min = 0.150635 ms, max = 0.151855 ms, mean = 0.151147 ms, median = 0.151108 ms, percentile(99%) = 0.151855 ms
[03/24/2022-16:02:08] [I] GPU Compute Time: min = 157.801 ms, max = 171.648 ms, mean = 160.231 ms, median = 158.118 ms, percentile(99%) = 171.648 ms
[03/24/2022-16:02:08] [I] D2H Latency: min = 0.0666504 ms, max = 0.0751953 ms, mean = 0.073309 ms, median = 0.0737305 ms, percentile(99%) = 0.0751953 ms
[03/24/2022-16:02:08] [I] Total Host Walltime: 3.36978 s
[03/24/2022-16:02:08] [I] Total GPU Compute Time: 3.36484 s

I have tried to generate an engine file for a model with INT8 precision as well to see if that would improve the performance. Here is the benchmark for that generated model:

[03/24/2022-14:20:02] [I] === Trace details ===
[03/24/2022-14:20:02] [I] Trace averages of 10 runs:
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 29.6069 ms - Host latency: 29.8325 ms (end to end 29.8428 ms, enqueue 4.81874 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 30.0944 ms - Host latency: 30.32 ms (end to end 30.3322 ms, enqueue 4.77361 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 31.7687 ms - Host latency: 31.9942 ms (end to end 32.005 ms, enqueue 4.92628 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 34.1578 ms - Host latency: 34.383 ms (end to end 34.3928 ms, enqueue 5.40447 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 35.1031 ms - Host latency: 35.3282 ms (end to end 35.3396 ms, enqueue 5.06415 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 38.1269 ms - Host latency: 38.3525 ms (end to end 38.4138 ms, enqueue 5.07999 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 36.2894 ms - Host latency: 36.5153 ms (end to end 36.7761 ms, enqueue 5.03879 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 30.9542 ms - Host latency: 31.1796 ms (end to end 31.1892 ms, enqueue 3.86367 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 29.6091 ms - Host latency: 29.8348 ms (end to end 29.8465 ms, enqueue 3.34136 ms)
[03/24/2022-14:20:02] [I] 
[03/24/2022-14:20:02] [I] === Performance summary ===
[03/24/2022-14:20:02] [I] Throughput: 30.2517 qps
[03/24/2022-14:20:02] [I] Latency: min = 29.7716 ms, max = 42.6958 ms, mean = 33.0112 ms, median = 29.867 ms, percentile(99%) = 42.6958 ms
[03/24/2022-14:20:02] [I] End-to-End Host Latency: min = 29.7776 ms, max = 42.7061 ms, mean = 33.0546 ms, median = 29.8798 ms, percentile(99%) = 42.7061 ms
[03/24/2022-14:20:02] [I] Enqueue Time: min = 3.22852 ms, max = 6.39673 ms, mean = 4.67438 ms, median = 4.81081 ms, percentile(99%) = 6.39673 ms
[03/24/2022-14:20:02] [I] H2D Latency: min = 0.150024 ms, max = 0.165527 ms, mean = 0.151203 ms, median = 0.150879 ms, percentile(99%) = 0.165527 ms
[03/24/2022-14:20:02] [I] GPU Compute Time: min = 29.5451 ms, max = 42.4719 ms, mean = 32.7856 ms, median = 29.6414 ms, percentile(99%) = 42.4719 ms
[03/24/2022-14:20:02] [I] D2H Latency: min = 0.0681152 ms, max = 0.0783691 ms, mean = 0.0742972 ms, median = 0.0744324 ms, percentile(99%) = 0.0783691 ms
[03/24/2022-14:20:02] [I] Total Host Walltime: 3.04115 s
[03/24/2022-14:20:02] [I] Total GPU Compute Time: 3.01628 s

When I run my pipeline with this engine file (and the network-mode in the configuration file changed to 1), the performance is better. I am able to hit 30fps again and the GPU% hovers between 70 and 90% rather than staying at 99%. However, I don’t see any bounding boxes drawn. Is there something else I need to change or configure to run the model with INT8 precision? I am able to print out the class scores for the YOLO bounding boxes through the custom bounding box parser implementation and they appear to be lower by a factor of 100 for the INT8 model, so I am not sure if there is something else I need to change.

In general, I am looking for advice on how to proceed with figuring out where the performance issues are coming from and how we can potentially mitigate them. I can provide more details but unfortunately cannot share the model outside of my organization.

Per above setting ,you run it with FP32 mode which is much worse than INT8.
The easiest way is changing it to “network-mode=2” to run FP16.
If you want to run INT8, you need to have INT8 calibration table besides changing to “network-mode=1”, INT8 calibration table needs some effort.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.