RTSP decoding artifacts and jitter in DeepStream when pipeline cannot process all frames (P/B frame dependency issue)

Please provide the following information when requesting support.

  • Hardware Platform (Jetson / GPU): Jetson
  • DeepStream Version: 6.3 / 7.1 (native and containerized)
  • JetPack Version: 5.1 / 6.2
  • Hardware tested:
    • NVIDIA Jetson Xavier NX
    • NVIDIA Jetson Orin NX

Hello,

I noticed that many developers seem to experience RTSP decoding quality issues when using NVIDIA DeepStream SDK on NVIDIA Jetson devices.

Several discussions seem related to this topic:

I conducted my own investigation and would like to share some observations.

Test conditions

To rule out external factors, we tested multiple configurations:

  • different network setups

  • a dedicated isolated network with only one camera and the Jetson

  • both H264 and H265 streams

  • multiple DeepStream versions (6.3 and 7.1)

The behavior remained identical in all cases.

Observed behavior

The issue appears when the pipeline cannot process all incoming frames from the RTSP stream.

This typically happens when the pipeline becomes GPU bottlenecked, for example when inference is enabled.

When this occurs, the decoded video sometimes shows:

  • jitter

  • visual artifacts

  • temporal instability

Experiments

1 — Stream containing only I-frames

If the RTSP stream is encoded only with I-frames, decoding works perfectly.

The output stream remains stable and artifact-free.


2 — Stream containing I / P / B frames (normal GOP)

When using a normal GOP structure with P and B frames, artifacts and jitter appear when the pipeline cannot process all frames.


3 — Decoding only I-frames

If the decoder is configured to drop P/B frames and decode only I-frames, the output becomes stable again.

However:

  • effective FPS decreases

  • temporal resolution is reduced


4 — Large GOP experiment

Another experiment produced an interesting result.

I configured the encoder with one I-frame followed by the maximum number of P-frames (255).

When the pipeline is able to process all frames using a fast model, the output stream has no artifacts or jitter, even with this very large GOP.

This suggests that the issue is strongly correlated with frame dropping during decoding rather than the GOP structure itself.

Additional tests performed

I tested many parameters available in DeepStream and GStreamer, including:

Despite these tests, the behavior remained the same.

Core question

In real deployments, it is very common that a pipeline cannot guarantee processing every frame, especially when GPU load varies.

Therefore my question is:

What is the recommended approach to guarantee stable RTSP decoding quality when the pipeline cannot process all incoming frames?

The pipeline can process every frame, the key problem is that the pipeline can not guarantee the frames be handled in time, the delayed frames will need extra buffering. For DeepStream, No pipeline can provide unlimited buffering, the RTSP client may be impacted to drop packets for the buffering overloaded. It is not the decoder problem, it is a delay caused packet lost problem.

The key is to avoid RTSP packet loss. Either you need to make the pipeline downstream to run fast enough to catch up with the sources(E.G. avoid GPU overloaded,…etc), or you need to make the RTSP source be slower(reduce the sending FPS in server side).

Hi,

Thanks for your explanation, I understand the point regarding buffering and delays.

I would like to clarify that I already tested an approach where frames are dropped after decoding in dec_que%d, with the idea that:

  • the decoder always receives a complete frame sequence (no missing references)

  • frames are only dropped later in the pipeline if processing cannot keep up with a leaky=2 queue

In theory, this should prevent artifacts since decoding would remain valid.

However, in practice, this approach does not work.

Even with leaky queues placed after the decoder, I still observe:

  • jitter

  • visual artifacts

  • temporal instability

This suggests that the pipeline remains coupled, and that backpressure propagates upstream despite the presence of queues.

In other words:
👉 it seems we cannot guarantee that frame dropping happens strictly after decoding
👉 and therefore we cannot prevent frame loss from affecting the decoder state

Can you confirm ?

This aligns with what I observe:

  • when the pipeline runs slower than the input FPS (e.g. 25 → 15 FPS)

  • buffers eventually fill up

  • and frame dropping still impacts decoding consistency

Also, to clarify my constraints:

  • I cannot reduce the RTSP input FPS (camera-side constraint)

  • I cannot significantly speed up the pipeline (GPU/inference constraint)

Given this, I am wondering if a more robust approach would be to fully decouple decoding from inference, for example by splitting into two pipelines:

  • one pipeline dedicated to RTSP → decode (running at full FPS)

  • another pipeline consuming frames independently for inference

This could potentially avoid backpressure affecting the decoder.

Do you think this kind of architecture is recommended or supported in DeepStream?

I performed additional debugging by monitoring all queues in the pipeline over time.

What I observe is that queues never actually fill up:
most of them stay very low (typically below ~10%), and some remain at 0%.

At the same time, I see frequent queue underruns across multiple stages.

This suggests that the pipeline is not locally congested (no queue saturation), but rather globally slowed down, with backpressure propagating early and preventing buffers from accumulating.

So even with non-leaky queues, it seems that stages are not effectively decoupled, and the throughput is regulated across the whole pipeline instead of building up in buffers.

This seems to confirm that frame dropping or slowdown happens upstream before queues can absorb the difference.

RTSP packet loss does not equal to frame dropping.

I think you are talking about the decoding artifacts in the original post. The frame dropping is another topic.

To avoid RTSP packets loss needs the downstream to consume the received packets as soon as possible if there is no internet issue.

Suppose your inference pipeline can only handle 10 FPS while your decoding pipeline always produce 30 FPS. How will you handle the extra 20 frames per second?

It is no meaning to measure the queue. With DeepStream we implement the most components to share video frame buffers instead of copy video frames between the components. The video frame buffers are managed by the internal buffer pool, that means there are only limited number of buffers allocated and the buffers are used in cycle. Even the queue plugin can queue unlimited number of buffers, there is no extra buffers to be queued.

Your inference pipeline is already GPU overloaded, to copy buffers between components and use unlimited buffer pool will make the work loading more heavy.

Thank you for these explanations, they really help clarify things.

I now fully understand the concept. Thanks for making it clear.

My goal is for the pipeline to handle the case where the camera sends a 30 FPS stream, even though the pipeline can only process 10 FPS. I don’t want to have to adjust the camera parameters depending on the pipeline’s current capacity, so ideally we should drop the extra frames.

However, from our previous discussions, I understand that it doesn’t seem possible to have part of the pipeline running at 30 FPS (decoding and producing raw frames) while the AI inference part runs at 10 FPS. From my experiments, the entire pipeline slows down once inference becomes the bottleneck.

One idea is to measure the actual FPS during runtime and dynamically adapt the frame interval or drop rate on the fly to match the pipeline’s current load on the Jetson. For example, if I launch multiple applications on the GPU and the pipeline now runs at 5 FPS instead of 10, I can increase frame dropping to still consume all RTSP packets without losing any.

Thanks for the clarification. It helps me a lot.

@Fiona.Chen Can you confirm ?

If you add your customized component between inference part and decoding part to drop 20 frames per second, it is possible. The key is that the frame dropping algorithm should be designed and implemented reasonably.