Deepstream: Slow framerate (7 FPS) for TensorRT segmentation engine (20ms GPU latency)

• Hardware Platform: Quadro RTX 3000 (Mobile)
• DeepStream Version: 5.1
• TensorRT: 7.2 nvcr.io/nvidia/tensorrt:21.03-py3
• NVIDIA GPU Driver Version 460.39

I’m trying to implement a video segmentation pipeline in Deepstream. Currently, instead of using a video source, I’m just reading in the grayscale images I want to segment using multifilesrc. After decoding, nvstreammux assembles a batch of size 1 before passing them to nvinfer (batch size 4 would stall the pipeline). However, since the engine is optimised for a batch size of 4, the nvinfer plugin is configured to process a batch of 4 images.

The pipeline is extremely slow, performing around 7 frames per second with only 10% GPU utilisation (shown by nvidia-smi). The pipeline is assembled and executed using gst-launch-1.0 directly in conjunction with the nvinfer config file. Normally, the model has a 20ms GPU latency (according to TensorRT), so I would assume the nvinfer plugin is slowing down the pipeline? The output of the nvinfer plugin is sent to a fakesink. Processing 180 images takes about 24 seconds, which is around 7 frames per second.

Pipeline

gst-launch-1.0 multifilesrc location=/opt/nvidia/deepstream/deepstream-5.1/deepstream-data/copied_images/%04d.jpg num-buffers=180 ! jpegparse ! nvv4l2decoder ! mux.sink_0 nvstreammux name=mux batch-size=1 batched-push-timeout=-1 width=768 height=288 attach-sys-ts=1 ! nvinfer interval=0 config-file-path=/opt/nvidia/deepstream/deepstream-5.1/deepstream-data/model_config.txt ! fakesink sync=false

Model config file

[property]
model-engine-file=/opt/nvidia/deepstream/deepstream-5.1/deepstream-data/model2.engine
network-type=2 (segmentation network)
network-mode=2 (using FP16, since the TensorRT engine was built with --fp16 enabled)
segmentation-output-order=0
workspace-size=4000
gie-unique-id=1
batch-size=4 (using this batch size since the TensorRT engine was optimised for a batch size of 4)
segmentation-threshold=0.0
infer-dims=1;288;768 (infer dims are grayscale images)
num-detected-classes=18
model-color-format=2 (using grayscale)
process-mode=1

[class-attrs-all]
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

TensorRT engine
The segmentation model (onnx) follows the UNet architecture, and is optimised by the TensorRT tool, trtexec.

trtexec --explicitBatch --onnx=model.onnx --saveEngine=model.engine --workspace=3500 --fp16 --shapes=input:4x1x288x768

After building a TensorRT engine using trtexec, the tool showed an inference latency of 20ms for a batch size of 4x1x288x768.

[04/13/2021-12:55:16] [I] Starting inference
[04/13/2021-12:55:19] [I] Warmup completed 0 queries over 200 ms
[04/13/2021-12:55:19] [I] Timing trace has 0 queries over 3.05046 s
[04/13/2021-12:55:19] [I] Trace averages of 10 runs:
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.4454 ms - Host latency: 24.6738 ms (end to end 38.2628 ms, enqueue 0.971362 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.5819 ms - Host latency: 24.7881 ms (end to end 38.6303 ms, enqueue 0.759543 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.5878 ms - Host latency: 24.7861 ms (end to end 38.7644 ms, enqueue 1.10361 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 20.2285 ms - Host latency: 25.5441 ms (end to end 39.8898 ms, enqueue 0.862378 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.4353 ms - Host latency: 24.6219 ms (end to end 38.2575 ms, enqueue 0.664331 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.3747 ms - Host latency: 24.5787 ms (end to end 38.3651 ms, enqueue 0.906921 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 20.4997 ms - Host latency: 25.7129 ms (end to end 40.4226 ms, enqueue 1.10041 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.07 ms - Host latency: 24.3139 ms (end to end 37.8092 ms, enqueue 1.05979 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.7233 ms - Host latency: 24.9192 ms (end to end 38.8653 ms, enqueue 1.10626 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.8489 ms - Host latency: 27.036 ms (end to end 43.0392 ms, enqueue 1.10441 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.7892 ms - Host latency: 27.0006 ms (end to end 43.0625 ms, enqueue 1.05925 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.642 ms - Host latency: 26.8948 ms (end to end 42.9504 ms, enqueue 0.800122 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 21.6922 ms - Host latency: 26.9041 ms (end to end 42.937 ms, enqueue 1.06628 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.8508 ms - Host latency: 25.0652 ms (end to end 39.3866 ms, enqueue 1.15254 ms)
[04/13/2021-12:55:19] [I] Average on 10 runs - GPU latency: 19.3599 ms - Host latency: 24.5655 ms (end to end 38.2248 ms, enqueue 0.75398 ms)
[04/13/2021-12:55:19] [I] Host Latency
[04/13/2021-12:55:19] [I] min: 23.9841 ms (end to end 37.0608 ms)
[04/13/2021-12:55:19] [I] max: 29.1921 ms (end to end 45.4465 ms)
[04/13/2021-12:55:19] [I] mean: 25.427 ms (end to end 39.9245 ms)
[04/13/2021-12:55:19] [I] median: 24.9196 ms (end to end 38.98 ms)
[04/13/2021-12:55:19] [I] percentile: 28.6995 ms at 99% (end to end 44.957 ms at 99%)
[04/13/2021-12:55:19] [I] throughput: 0 qps
[04/13/2021-12:55:19] [I] walltime: 3.05046 s
[04/13/2021-12:55:19] [I] Enqueue Time
[04/13/2021-12:55:19] [I] min: 0.293945 ms
[04/13/2021-12:55:19] [I] max: 1.70859 ms
[04/13/2021-12:55:19] [I] median: 1.05591 ms
[04/13/2021-12:55:19] [I] GPU Compute
[04/13/2021-12:55:19] [I] min: 18.7598 ms
[04/13/2021-12:55:19] [I] max: 23.9597 ms
[04/13/2021-12:55:19] [I] mean: 20.2087 ms
[04/13/2021-12:55:19] [I] median: 19.7258 ms
[04/13/2021-12:55:19] [I] percentile: 23.554 ms at 99%
[04/13/2021-12:55:19] [I] total compute time: 3.0313 s
&&&& PASSED TensorRT.trtexec # trtexec --explicitBatch --onnx=model.onnx --saveEngine=model.engine --workspace=3500 --fp16 --shapes=input:4x1x288x768

You can try to set “batched-push-timeout=2000 live-source=1” for nvstreammux.

Thanks for your reply. I tried it out and it has the same framerate (~25 seconds to process the 180 images). Actually, I’m not sure why I should set it as a live-source. Also, do I need the batched-push-timeout since I have a batch size of 1 in the nvstreammux?

You are using pictures but not set FPS with multifilesrc. So the pipeline may work in any FPS. multifilesrc

If you have set proper FPS with multifilesrc, you don’t need to set “live-source=1” for nvstreammux. “batched-push-timeout=2000” is better to set even batch-size is 1.

You need to know to use the plugin well if you want to use it.

Thanks for the advice. I thought it would process the pictures as fast as possible. However, with live-source enabled and the batched-push-timeout, the inference speed is still too slow with only 15% GPU utilisation. Should I further investigate the pipeline elements before nvinfer, or do you think nvinfer is the problem?

You may need to check the speed of the pipeline without nvinfer.

I inspected the pipeline with nsight. I noticed that between inference of batches, there is a delay of around 90ms. I assume this is caused by the dequeueOutputAndAttachMeta. I don’t see any GPU activity for that part, so I assume this is done on the CPU? If I change the network type to classification (just to see the difference), there is no delay between the inference of two batches.

Segmentation

Classification setting (same network)

I’m not sure yet what the postprocessing for the segmentation network contains, but it seems to be incredibly slow since I just want to do an argmax. Maybe it is better to just add this to my TensorRT engine, but I still want the output mask. So I guess I need to maintain the segmentation setting for the nvinfer plugin? Or should I change it to classification and attach the raw output tensor as meta data? This might be a good solution, but I don’t know if the “classification setting” can handle a 2D output map.

All processing you mentioned are open source. You can find them in /opt/nvidia/deepstream/deepstream/sources/libs/nvdsinfer in your device.

Hello! I’m also experiencing a huge performance drop when I switch my model from classification to segmentation type. I see two possible reasons why this happens:

  1. The segmentation output is much larger than a typical classification output, in my case being 720x1280 pixels mask. It has a bigger overhead to transfer such a tensor from GPU to CPU.
  2. Slow DeepStream built-in post-processing function which translates segmentation output layer into segmentation mask, i.e. taking argmax for each pixel of the output.

Can you please let us know whether one of this option is a real cause of the problem or there is something else?

Hi Vlad.Vinogradov,

Please help to open a new topic. thanks