How to maximize inferences/sec in a deepstream pipeline

• Hardware Platform: Jetson Xavier NX
• DeepStream Version: 5.0
• JetPack Version: 4.4
• TensorRT Version: 7.1

I want to use deepstream on jetson to run inference on many IP cameras with high-quality object detection models. I have followed the instructions for objectdetector_Yolo and updated the dstest3_pgie_config.txt file for the deepstream-test3-app to use it. It works well for a single camera.

Now, I would like to do as many inferences as I can on the jetson for all the IP cameras, without needing to hand-tune parameters such as “interval” for every deployment. I want deepstream to do something like this pseudo-code:

while true {
  for each camera {
    frame = get_most_recent_frame(camera)
    run_inference(frame)
  }
}

Based on reading the documentation, it sounded like setting sync=0 and qos=true would achieve this, e.g., adding the following to the deepstream-test3-app

g_object_set (G_OBJECT (sink), "sync", 0, "qos", TRUE, NULL);

However, when I compile and run the app, it runs very slowly. It seems it is processing every frame instead of dropping them as necessary, as I thought setting qos=TRUE would do. I can tell this because the timestamps of the IP cameras shown on the display fall way behind real time.

How can I modify deepstream-test3-app to do as many inferences as possible without falling behind realtime?

DeepStream does not work like this way, it does not process the camera inputs one by one.

DeepStream handles the inputs by batch mode, like below

camera input1 --> decoding --> |
… …(other streams)…| --> nvstreammux (batch the frames) --> inference1 —> tracking…
camera inputn --> decoding --> |

Normally, inference is bottleneck, you could refer to https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps#measure-the-inference-perf to measure the inference perf to see how many fps your model can run to.

Thanks!

Thanks for the clarification.

Doesn’t that mean, though, that the GPU must be significantly under-utilized in order to have smooth output?

How do I get smooth output if I am using a model that requires, say, 500ms to run inference? Even if I run only 1 FPS of inference on a 15FPS camera, inference is holding up the whole pipeline for 500ms, and meanwhile frames are not being rendered. At least, that is my observation using deepstream-app with 6 RTSP cameras at 15fps, and using yolov3 in FP16 mode. The GPU utilization shows spikes of 100% and periods of 0%, and the video output is very jumpy.

I have set batched-push-timeout=150000. Otherwise I noticed that only the first one or two cameras have inference run, and the rest seem to be aborted so that the pipeline could keep up. But this is not desirable, since I’m not getting inferences on many cameras.

Is there any sort of buffering I can do to smooth this out? I’d rather have the video delayed by 1 second and have smooth output rather than have jumpy output due to a heavy inference cost, and have inference run on all cameras rather than just the first few.

Doesn’t that mean, though, that the GPU must be significantly under-utilized in order to have smooth output?

If the output depends on the GPU infreence output, and GPU inference takes much long time than expected render interval, yes it will.
But, in this situation, below processing mode will also run into the same problem, and may be even worse since batch processing has higher effenciency than processing the frame one by one.

while true {
for each camera {
frame = get_most_recent_frame(camera)
run_inference(frame)
}
}

For the yolov3 issue, I think you could refer to https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps#measure-the-inference-perf to evaluate how many fps yolov3 can run on your GPU/system, and then decide how many camera streams, fps your DS based application can handle.
And, you could skip some frames for inference, that is, you can render evey frame, but don’t need to do inference for every frame for each stream.

Hi mchi, thank you for your responses.

Regarding your suggestion to follow https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps#measure-the-inference-perf , I was unable to do so. Perhaps this is not surprising, given that it is a long list of steps. Given that Jetson is a hardware platform owned and controlled by NVIDIA, why is this so complicated? Why not simply provide a docker image or binary? Why does it take so many complex steps to do something as simple as measure inference time? Can I measure it using deepstream-app instead?

That aside, I think I may have confused the issue for you. Let me restate my question in very simple terms:

  1. Suppose I have a model that requires 500ms to perform inference on a batch size of 6.
  2. Suppose I have 6 RTSP cameras operating at 15fps
  3. For now, I can accept performance as low as 1 inference per second, e.g. I am only asking that the GPU perform 500ms work every second, such that it is 50% loaded.

Is it at all possible to get smooth output from deepstream under these conditions? Or does the deepstream architecture require the entire pipeline must be held for 500ms while inference happens, and only then unblocked so that the remaining frames can be rendered? Is there a buffering step available post-inference to smooth out the jitter?

yes, there is docker
https://docs.nvidia.com/metropolis/deepstream/plugin-manual/index.html#page/DeepStream%20Plugins%20Development%20Guide/deepstream_plugin_docker.html#wwpID0EIHA

I can accept performance as low as 1 inference per second

you can use “interval” property https://docs.nvidia.com/metropolis/deepstream/plugin-manual/index.html#page/DeepStream%20Plugins%20Development%20Guide/deepstream_plugin_details.3.01.html#wwpID0E0OFB0HA to skip the batches and do 1 inference/second and also make output smooth.

I have set interval=14 for my 15fps cameras, so that only one inference per second happens.

Using deepstream-app, I see all videos in the tiled display pause for about half a second while inference happens, then play the frames back very quickly to catch up. The display is very jerky. How can it be made smooth?

Okay, so seems this jerky is unavoidable because the render of the inference frame has to wait for 500ms which blocks the render of following frames.
To make the display render smoothly, each component can’t be longer than the display interval, e.g. 33ms for 30fps display.

That is very unfortunate to hear. And also very surprising, because it makes deepstream completely unusable for what I would have thought was a very common use-case for the design of the Jetson platform, e.g. as an edge device with display capability that also has the GPU capacity to run high-quality models.

Knowing this, I will abandon deepstream and go back to building custom software that can do this better.

Thanks for the replies to my questions, even though the answer is disappointing.

You can probably modify nvinfer plugin to do the inference asynchronously, from a separate thread. In parallel, you would add an internal queue that adds a configurable amount of delay to the batch buffers. You would adjust the queue delay to be as much as the maximum inference time.

So then when a buffer enters the queue, the inference starts in a different thread, and by the time the buffer exits the queue, the inference results would be ready to attach as metadata.

This would achieve exactly what you want. I don’t know why this isn’t the default operation mode for nvinfer yet, as I agree with you, without this functionality the Jetson platforms are quite inefficient economically.

Edit: this feature would also make the interval setting obsolete, because in the way I described above, it would adjust automatically to the best hardware capability. I hope this will become standard in deepstream 6

1 Like

Agree, that seems like a possible approach. Or maybe a buffered sink right before display, which wouldn’t require doing anything async. Your async mode would get the most out of the GPU, a simple buffer sink would at least make things smooth, though still leave GPU cycles on the table.

I wonder, does “interval” exist just for headless operation? Same for the whole concept of a tracker. Neither idea seems useful for doing display if the architecture expects you to have inference run at full camera fps for a smooth display.

Another idea that seems not possible in the deepstream architecture is to do inference on a rolling basis across cameras (which is what I was getting at in my initial post). For my case of 6+ cameras, if I can do a single inference in < 100ms vs a batch of 6 in 500ms, I’d rather do that once per second per camera, since the 100ms slowdown for a single camera would be barely noticeable for a 15fps stream.

This should already be possible by setting a very small value for batched-push-timeout for nvstreammux so the batch size is always 1, therefore the inference will only happen on 1 image at a time. Combine this with nvinfer interval and you can roughly obtain what you want without any custom code.