• Hardware Platform: Quadro RTX 3000 (Mobile)
• DeepStream Version: 5.1
• TensorRT: 7.2 nvcr.io/nvidia/tensorrt:21.03-py3
• NVIDIA GPU Driver Version 460.39
I’m trying to implement a video segmentation pipeline in Deepstream. Currently, instead of using a video source, I’m just reading in the grayscale images I want to segment using multifilesrc. After decoding, nvstreammux assembles a batch of size 1 before passing them to nvinfer (batch size 4 would stall the pipeline). However, since the engine is optimised for a batch size of 4, the nvinfer plugin is configured to process a batch of 4 images.
The pipeline is extremely slow, performing around 7 frames per second with only 10% GPU utilisation (shown by nvidia-smi). The pipeline is assembled and executed using gst-launch-1.0 directly in conjunction with the nvinfer config file. Normally, the model has a 20ms GPU latency (according to TensorRT), so I would assume the nvinfer plugin is slowing down the pipeline? The output of the nvinfer plugin is sent to a fakesink. Processing 180 images takes about 24 seconds, which is around 7 frames per second.
Pipeline
gst-launch-1.0 multifilesrc location=/opt/nvidia/deepstream/deepstream-5.1/deepstream-data/copied_images/%04d.jpg num-buffers=180 ! jpegparse ! nvv4l2decoder ! mux.sink_0 nvstreammux name=mux batch-size=1 batched-push-timeout=-1 width=768 height=288 attach-sys-ts=1 ! nvinfer interval=0 config-file-path=/opt/nvidia/deepstream/deepstream-5.1/deepstream-data/model_config.txt ! fakesink sync=false
Model config file
[property]
model-engine-file=/opt/nvidia/deepstream/deepstream-5.1/deepstream-data/model2.engine
network-type=2 (segmentation network)
network-mode=2 (using FP16, since the TensorRT engine was built with --fp16 enabled)
segmentation-output-order=0
workspace-size=4000
gie-unique-id=1
batch-size=4 (using this batch size since the TensorRT engine was optimised for a batch size of 4)
segmentation-threshold=0.0
infer-dims=1;288;768 (infer dims are grayscale images)
num-detected-classes=18
model-color-format=2 (using grayscale)
process-mode=1
[class-attrs-all]
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0
TensorRT engine
The segmentation model (onnx) follows the UNet architecture, and is optimised by the TensorRT tool, trtexec.
trtexec --explicitBatch --onnx=model.onnx --saveEngine=model.engine --workspace=3500 --fp16 --shapes=input:4x1x288x768
After building a TensorRT engine using trtexec, the tool showed an inference latency of 20ms for a batch size of 4x1x288x768.
[04/13/2021-12:55:16] [I] Starting inference
[04/13/2021-12:55:19] [I] Warmup completed 0 queries over 200 ms
[04/13/2021-12:55:19] [I] Timing trace has 0 queries over 3.05046 s
[04/13/2021-12:55:19] [I] Trace averages of 10 runs:
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.4454 ms - Host latency: 24.6738 ms (end to end 38.2628 ms, enqueue 0.971362 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.5819 ms - Host latency: 24.7881 ms (end to end 38.6303 ms, enqueue 0.759543 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.5878 ms - Host latency: 24.7861 ms (end to end 38.7644 ms, enqueue 1.10361 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 20.2285 ms - Host latency: 25.5441 ms (end to end 39.8898 ms, enqueue 0.862378 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.4353 ms - Host latency: 24.6219 ms (end to end 38.2575 ms, enqueue 0.664331 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.3747 ms - Host latency: 24.5787 ms (end to end 38.3651 ms, enqueue 0.906921 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 20.4997 ms - Host latency: 25.7129 ms (end to end 40.4226 ms, enqueue 1.10041 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.07 ms - Host latency: 24.3139 ms (end to end 37.8092 ms, enqueue 1.05979 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.7233 ms - Host latency: 24.9192 ms (end to end 38.8653 ms, enqueue 1.10626 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.8489 ms - Host latency: 27.036 ms (end to end 43.0392 ms, enqueue 1.10441 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.7892 ms - Host latency: 27.0006 ms (end to end 43.0625 ms, enqueue 1.05925 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.642 ms - Host latency: 26.8948 ms (end to end 42.9504 ms, enqueue 0.800122 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.6922 ms - Host latency: 26.9041 ms (end to end 42.937 ms, enqueue 1.06628 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.8508 ms - Host latency: 25.0652 ms (end to end 39.3866 ms, enqueue 1.15254 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.3599 ms - Host latency: 24.5655 ms (end to end 38.2248 ms, enqueue 0.75398 ms)
[04/13/2021-12:55:19] [I] Host Latency
[04/13/2021-12:55:19] [I] min: 23.9841 ms (end to end 37.0608 ms)
[04/13/2021-12:55:19] [I] max: 29.1921 ms (end to end 45.4465 ms)
[04/13/2021-12:55:19] [I] mean: 25.427 ms (end to end 39.9245 ms)
[04/13/2021-12:55:19] [I] median: 24.9196 ms (end to end 38.98 ms)
[04/13/2021-12:55:19] [I] percentile: 28.6995 ms at 99% (end to end 44.957 ms at 99%)
[04/13/2021-12:55:19] [I] throughput: 0 qps
[04/13/2021-12:55:19] [I] walltime: 3.05046 s
[04/13/2021-12:55:19] [I] Enqueue Time
[04/13/2021-12:55:19] [I] min: 0.293945 ms
[04/13/2021-12:55:19] [I] max: 1.70859 ms
[04/13/2021-12:55:19] [I] median: 1.05591 ms
[04/13/2021-12:55:19] [I] GPU Compute
[04/13/2021-12:55:19] [I] min: 18.7598 ms
[04/13/2021-12:55:19] [I] max: 23.9597 ms
[04/13/2021-12:55:19] [I] mean: 20.2087 ms
[04/13/2021-12:55:19] [I] median: 19.7258 ms
[04/13/2021-12:55:19] [I] percentile: 23.554 ms at 99%
[04/13/2021-12:55:19] [I] total compute time: 3.0313 s
&&&& PASSED TensorRT.trtexec # trtexec --explicitBatch --onnx=model.onnx --saveEngine=model.engine --workspace=3500 --fp16 --shapes=input:4x1x288x768