Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU): Jetson Xavier NX
• DeepStream Version: 6.0.1
• JetPack Version (valid for Jetson only): 4.6.1
• TensorRT Version: 8.2.3
• Issue Type( questions, new requirements, bugs): performance
Hello, I am currently running a custom YOLOv4 model via Deepstream. I have implemented a custom bounding box function and have seen the model produce bounding boxes on a video stream from my USB camera. However, when running the pipeline with my custom model, the performance is quite bad. My video stream goes from 30 fps to ~8fps, the GPU usage in tegrastats
shoots up to 99% and stays there, and even on the highest power mode I get a system warning that the device is running overcurrent.
Here is my gstreamer pipeline:
gst-launch-1.0 nvarguscamerasrc num-buffers=600 bufapi-version=1 ! "video/x-raw(memory:NVMM), format=(string)NV12,width=1920,height=1080" ! queue ! mux.sink_0 nvstreammux name=mux width=1920 height=1080 batched-push-timeout=40000 batch-size=1 ! queue ! nvinfer config-file-path=config_infer_primary_yoloV4.txt batch-size=1 ! nvvideoconvert ! nvdsosd ! nvvideoconvert ! nvoverlaysink sync=false
and the contents of config_infer_primary_yoloV4.txt
:
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
#0=RGB, 1=BGR
model-color-format=0
model-engine-file=<my-engine-file>
#model-engine-file=<my-engine-file>
labelfile-path=labels.txt
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=0
num-detected-classes=14
gie-unique-id=1
network-type=0
## 0=Group Rectangles, 1=DBSCAN, 2=NMS, 3= DBSCAN+NMS Hybrid, 4 = None(No clustering)
cluster-mode=2
maintain-aspect-ratio=1
parse-bbox-func-name=NvDsInferParseCustomYoloV4
custom-lib-path=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
#scaling-filter=0
#scaling-compute-hw=0
[class-attrs-all]
nms-iou-threshold=0.3
pre-cluster-threshold=0.7
Here are the benchmark details when I generate the engine file from my ONNX model using trtexec
[03/24/2022-16:02:08] [I] === Trace details ===
[03/24/2022-16:02:08] [I] Trace averages of 10 runs:
[03/24/2022-16:02:08] [I] Average on 10 runs - GPU latency: 161.096 ms - Host latency: 161.321 ms (end to end 161.329 ms, enqueue 6.72987 ms)
[03/24/2022-16:02:08] [I] Average on 10 runs - GPU latency: 159.599 ms - Host latency: 159.823 ms (end to end 159.833 ms, enqueue 7.01724 ms)
[03/24/2022-16:02:08] [I]
[03/24/2022-16:02:08] [I] === Performance summary ===
[03/24/2022-16:02:08] [I] Throughput: 6.23187 qps
[03/24/2022-16:02:08] [I] Latency: min = 158.026 ms, max = 171.872 ms, mean = 160.455 ms, median = 158.344 ms, percentile(99%) = 171.872 ms
[03/24/2022-16:02:08] [I] End-to-End Host Latency: min = 158.036 ms, max = 171.882 ms, mean = 160.464 ms, median = 158.353 ms, percentile(99%) = 171.882 ms
[03/24/2022-16:02:08] [I] Enqueue Time: min = 5.49029 ms, max = 8.14746 ms, mean = 6.838 ms, median = 6.90735 ms, percentile(99%) = 8.14746 ms
[03/24/2022-16:02:08] [I] H2D Latency: min = 0.150635 ms, max = 0.151855 ms, mean = 0.151147 ms, median = 0.151108 ms, percentile(99%) = 0.151855 ms
[03/24/2022-16:02:08] [I] GPU Compute Time: min = 157.801 ms, max = 171.648 ms, mean = 160.231 ms, median = 158.118 ms, percentile(99%) = 171.648 ms
[03/24/2022-16:02:08] [I] D2H Latency: min = 0.0666504 ms, max = 0.0751953 ms, mean = 0.073309 ms, median = 0.0737305 ms, percentile(99%) = 0.0751953 ms
[03/24/2022-16:02:08] [I] Total Host Walltime: 3.36978 s
[03/24/2022-16:02:08] [I] Total GPU Compute Time: 3.36484 s
I have tried to generate an engine file for a model with INT8 precision as well to see if that would improve the performance. Here is the benchmark for that generated model:
[03/24/2022-14:20:02] [I] === Trace details ===
[03/24/2022-14:20:02] [I] Trace averages of 10 runs:
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 29.6069 ms - Host latency: 29.8325 ms (end to end 29.8428 ms, enqueue 4.81874 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 30.0944 ms - Host latency: 30.32 ms (end to end 30.3322 ms, enqueue 4.77361 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 31.7687 ms - Host latency: 31.9942 ms (end to end 32.005 ms, enqueue 4.92628 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 34.1578 ms - Host latency: 34.383 ms (end to end 34.3928 ms, enqueue 5.40447 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 35.1031 ms - Host latency: 35.3282 ms (end to end 35.3396 ms, enqueue 5.06415 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 38.1269 ms - Host latency: 38.3525 ms (end to end 38.4138 ms, enqueue 5.07999 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 36.2894 ms - Host latency: 36.5153 ms (end to end 36.7761 ms, enqueue 5.03879 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 30.9542 ms - Host latency: 31.1796 ms (end to end 31.1892 ms, enqueue 3.86367 ms)
[03/24/2022-14:20:02] [I] Average on 10 runs - GPU latency: 29.6091 ms - Host latency: 29.8348 ms (end to end 29.8465 ms, enqueue 3.34136 ms)
[03/24/2022-14:20:02] [I]
[03/24/2022-14:20:02] [I] === Performance summary ===
[03/24/2022-14:20:02] [I] Throughput: 30.2517 qps
[03/24/2022-14:20:02] [I] Latency: min = 29.7716 ms, max = 42.6958 ms, mean = 33.0112 ms, median = 29.867 ms, percentile(99%) = 42.6958 ms
[03/24/2022-14:20:02] [I] End-to-End Host Latency: min = 29.7776 ms, max = 42.7061 ms, mean = 33.0546 ms, median = 29.8798 ms, percentile(99%) = 42.7061 ms
[03/24/2022-14:20:02] [I] Enqueue Time: min = 3.22852 ms, max = 6.39673 ms, mean = 4.67438 ms, median = 4.81081 ms, percentile(99%) = 6.39673 ms
[03/24/2022-14:20:02] [I] H2D Latency: min = 0.150024 ms, max = 0.165527 ms, mean = 0.151203 ms, median = 0.150879 ms, percentile(99%) = 0.165527 ms
[03/24/2022-14:20:02] [I] GPU Compute Time: min = 29.5451 ms, max = 42.4719 ms, mean = 32.7856 ms, median = 29.6414 ms, percentile(99%) = 42.4719 ms
[03/24/2022-14:20:02] [I] D2H Latency: min = 0.0681152 ms, max = 0.0783691 ms, mean = 0.0742972 ms, median = 0.0744324 ms, percentile(99%) = 0.0783691 ms
[03/24/2022-14:20:02] [I] Total Host Walltime: 3.04115 s
[03/24/2022-14:20:02] [I] Total GPU Compute Time: 3.01628 s
When I run my pipeline with this engine file (and the network-mode
in the configuration file changed to 1), the performance is better. I am able to hit 30fps again and the GPU% hovers between 70 and 90% rather than staying at 99%. However, I don’t see any bounding boxes drawn. Is there something else I need to change or configure to run the model with INT8 precision? I am able to print out the class scores for the YOLO bounding boxes through the custom bounding box parser implementation and they appear to be lower by a factor of 100 for the INT8 model, so I am not sure if there is something else I need to change.
In general, I am looking for advice on how to proceed with figuring out where the performance issues are coming from and how we can potentially mitigate them. I can provide more details but unfortunately cannot share the model outside of my organization.