Latency Issue in DeepStream 6.3 when doing batched inference

Hello,

I am currently working on a DeepStream 6.3 C++ project that processes two RTSP streams as input. My application is similar to the DeepStream reference application, where each stream is decoded and converted individually before being muxed together and propagated through various networks.

I have been experimenting with batch sizes and noticed the following:

  1. When batching the two streams together, the latency for my first neural network is 10ms.
  2. When not batching the streams, the latency per stream is 4.5ms. The total latency from the start of inference for the first stream to the finish for the slower stream is 7ms.

In both cases I used the same engine file which was build with a batch size of 2.

These results seem counterintuitive to me. Could you provide an explanation for this discrepancy, or is there something I might be doing wrong?

Thank you in advance for your assistance.

Best regards,
David

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Orin NX
• DeepStream Version 6.3
• JetPack Version (valid for Jetson only) 5.1.2
• TensorRT Version 8.5.2.2
• NVIDIA GPU Driver Version (valid for GPU only) 11.4.315
• Issue Type( questions, new requirements, bugs) questions/bug
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Please refer to DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums to make sure the nvstreammux is configured corerctly.

Please provide the complete pipeline and conigurations you are using.

Hello @Fiona.Chen,

thanks for your reply. I read through the FAQ and checked if my nvstreammux is configured correctly. However, everything was fine with my configuration.

Regarding reproducibility, I have recreated my scenario inside the deepstream reference application. Thereby I observed the same phenomenon, however in a far less drastic way, where the batched version was only slower by a bit.

The steps to reproduce are:

  1. Re-encode the sample video to have no b-frames:
cp /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4 .
ffmpeg -i sample_720p.mp4 -c:v libx264 -profile:v main -bf 0 -an sample_720p_new.mp4
  1. Stream the new video:
#!/bin/bash

# Start the rtsp-simple-server in the background, the script I use is:
./rtsp-simple-server rtsp-simple-server.yml &

# Give the server a few seconds to start up
sleep 5

ffmpeg -re -stream_loop -1 -i sample_720p_new.mp4 -r 30 -c copy  -f rtsp rtsp://localhost:8554/teststream1 &

# Wait for all background processes to complete
wait
  1. I ran the deepstream reference application with once batching and once without (set batch-size=1 inside the streammux) :
cd /opt/nvidia/deepstream/deepstream-6.3/sources/apps/sample_apps/deepstream-app
sudo NVDS_ENABLE_LATENCY_MEASUREMENT=1 NVDS_ENABLE_COMPONENT_LATENCY_MEASUREMENT=1 ./deepstream-app -c /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/batching_test.txt > ~/performance_with_batching_rtsp.txt
-> changed batching_test.txt conf
sudo NVDS_ENABLE_LATENCY_MEASUREMENT=1 NVDS_ENABLE_COMPONENT_LATENCY_MEASUREMENT=1 ./deepstream-app -c /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/batching_test.txt > ~/performance_without_batching_rtsp.txt
  1. Used a python script to collet the timings.
╰─➤  python performance_calculator.py performance_with_batching_rtsp.txt 
Average time difference: 8.693900 ms
Median time difference: 8.223145 ms
Quantiles time difference: [6.7068359375, 6.72021484375, 6.7470703125, 7.22158203125, 8.22314453125, 9.25849609375, 10.255859375, 11.316259765625, 11.43271484375] ms
╰─➤  python performance_calculator.py performance_without_batching_rtsp.txt
Average time difference: 7.884207 ms
Median time difference: 7.291992 ms
Quantiles time difference: [6.01806640625, 6.028076171875, 6.0439453125, 6.27353515625, 7.2919921875, 8.215673828125, 9.295703125, 10.253515625, 10.760009765625] ms

This script heavily favors the batched version, since it calculates time as follows:
time of frame x = max(out_time_pgie_source_0, out_time_pgie_source_1) - min(in_time_pgie_source_0, in_time_pgie_source_1)
With this calculation, I have also got an overestimate of how long the non-batched version takes for both input sources per frame.

All the used files are the following:
performance_with_batching_rtsp.txt (3.2 MB)
performance_without_batching_rtsp.txt (3.2 MB)
performance_without_batching_rtsp_err.txt (5.2 KB)
performance_with_batching_rtsp_err.txt (5.2 KB)
performance_calculator_py.txt (2.6 KB)
batching_test.txt (4.3 KB)

I suspect that the error lies within my rtsp stream, since the normal sample stream works and with the no-b-frame stream I get the NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered error, as can be seen inside the performance_without_batching_rtsp_err.txt and performance_with_batching_rtsp_err.txt files. However, this is just a guess.

Do you have an idea on how I could fix the error, and what the reason would be that the non-batched version runs overall faster than the batched version?

Thank you in advance.

What did you change for the “non-batching” version?

I just modified the batch-size parameter of the streammux from 2 to 1 inside the batching_test.txt config file.

For your case, seems the model is fast enough, so the “batch-size=1” will be better for live stream since the nvstreammux will not try to wait for the frame.

Thank you for your reply. I was already aware of the waiting time and began timing only after the streammux was completed, specifically measuring from the start to the end of the nvinfer. Even with this adjustment, the non-batched version still outperformed, which shouldn’t be happening.

Interestingly, this issue only occurs when I remove the b-frames from my video; otherwise, it works as expected, where the batched version is faster.

Hi @Fiona.Chen, just checking in—any updates on this?

Appreciate your help!

Can you measure with the local mp4 file?

The ffmpeg transcoding command does not generate any b frame.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.