PGIE component latency

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version Docker 6.3
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

I’m using deepstream-app -c deepstream_app_config.txt and when I increase the batch-size in config_infer_primary_yoloV8.txt, which is offered by this repo. The primary_gie component latency doesn’t change and I’m confused about it. When the batch-size increases, shouldn’t the component latency decrease theoretically?
Therefore, I would like to know how DS uses batch in its inference? Is it feeding GstNvstreammux batch processed images serially to GstNvinfer or in parallel?

deepstream_app_config.txt (3.3 KB)
config_infer_primary_yoloV8.txt (741 Bytes)

It is not true. The nvinfer batch-size only impact TensorRT inferencing time.
The nvinfer component is a GStreamer plugin, the pre-processing, inferencing and postprocessing are done inside nvinfer. They are asynchronizely done in different threads. DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums
For yolov7 and yolov8, the postprocessing is relatively complicated, and if the postprocessing is done with CPU, it will take much longer time than TensorRT inferencing. So the nvinfer latency is mainly decided by the postprocessing but not TensorRT inferencing. And nvinfer batch size only impact TensorRT inferencing.

Depends on the nvinfer batch size and nvstreammux batch size. E.G. If you batch the frames with nvstreammux batch size 4 while you inferencing the batch with nvinfer batch size 1, the frames in the batch will be inferenced one by one.

gst-nvinfer is open source, please refer to the source code for more details.

Thanks for your response.
As for nvinfer batch size, where can I find this ? PGIE/config-file?

Yeah, When the number of sources are equal to 8, I set the batch-size in nvstreammux to 8, and then change the batch-size in the config-file under PGIE from 1 to 8, when batch-size = 8, the component latency is almost unchanged. Could I understand the primary_gie component latency is not exactly yolov8n’s inference time for the 8 images in upstream nvstreammux?
And the PGIE component latency are composed by pre-processing, inferencing and postprocessing as you said, Without changing the nvstreammux batch-size, the elapsed time between pre-processing and post-processing should not change much. When the batch-size under the profile-file is larger, the inference elapsed time of the model should be less, shouldn’t the overall elapsed time of the PGIE component latency be less? I’m not sure if I’m understanding this correctly, so if there is a problem, I hope you can give me a more detailed explanation.
Meanwhile, If I just want to get the model pure reasoning time consuming, how could I do to get it?
Thanks again, sincerely looking forward to your reply.

Yes. It can also be configured by gst-nvinfer property. Gst-nvinfer — DeepStream 6.3 Release documentation

I have explained what does the primary_gie component latency mean in PGIE component latency - #3 by Fiona.Chen

Thanks, if I want to get the pre-processing, infer and post-processing time separately, how should I go about it?

The gst-nvinfer is open source, you can measure with adding code in the code.

Thanks, I’m wondering why the cudaEventSynchronize in dequeueOutputBatch is taking so long? It is supposed to be a process of copying the inference results obtained on the GPU to the CPU.

sources = 8     batch(nvstreammux) = 8    batch(nvinfer) = 8
cudaEventSynchronize time = 13.2ms

I’m very curious about the time it takes.

Yes. That is why we try to avoid such operation. E.G. We use CUDA to do postprocessing for YoloV7 in this sample. yolo_deepstream/deepstream_yolo at main · NVIDIA-AI-IOT/yolo_deepstream (

Thanks, I will try it later.

I run the CLI as the repo offered, But it doesn’t look particularly right.
The video shows the results before post-processing, and Frame Latency and dequeueOutputBatch Time look worse than before, it doesn’t seem to have the desired effect。