High Latency and GPU Contention when running DeepStream (Python) + VSS on DGX Platform

Hi everyone,

I am running a NVIDIA DeepStream pipeline using Python bindings on a NVIDIA DGX platform. I am encountering significant performance degradation and resource contention issues that I hope to get some insight on.

The Setup:

  • Platform: NVIDIA DGX (Spark DGX)

  • Software: DeepStream SDK (Python bindings)

  • Input: Single 1280x720 RTSP stream

  • Pipeline: RTSP Source → Decoder → nvinfer → Tracker → OSD → Sink

  • Additional Service: NVIDIA VSS (Video Storage Service/Toolkit)

The Problem: We are observing a drastic difference in performance depending on whether the VSS service is active.

  1. Scenario A (CV App Only): When running the Computer Vision application in isolation, everything works perfectly. The pipeline is smooth, and there is no lag.

  2. Scenario B (CV App + VSS): As soon as we enable NVIDIA VSS alongside the CV application, the performance degrades significantly. The application becomes very slow, and we observe massive delays and accumulated latency in stream processing. The GPU compute load spikes disproportionately (often hitting ~90% utilization or causing bottlenecks), even though the scene complexity hasn’t changed.

Comparison with other hardware: Interestingly, we tested this exact same application configuration on a Brev.dev instance equipped with 2x L40S GPUs, and it worked correctly without these performance penalties. This suggests the issue might be specific to the DGX environment configuration or resource management between VSS and DeepStream on this specific hardware.

My Questions:

  1. Root Cause Analysis: What could be causing such high contention between VSS and the DeepStream Python app on the DGX platform specifically? It feels like they are fighting for the same resources despite the hardware being powerful.

  2. Debugging: How can I effectively debug this interference? Are there specific profiling tools (Nsight Systems/Compute) or VSS logs that would pinpoint where the bottleneck is occurring (e.g., memory bandwidth, CUDA cores, NVDEC)?

  3. Optimization:

    • Are there DeepStream config parameters that help reduce this specific contention?

    • Is the Python API adding significant overhead compared to C++ in this high-load scenario?

  4. Hardware Path: We are considering future hardware upgrades. Would migrating to the NVIDIA AGX Thor platform likely resolve these concurrency issues, or is this purely a software configuration problem?

Any guidance, configuration tips, or debugging steps would be greatly appreciated!

Thanks!

NVIDIA VSS is deployed in Event Reviewer mode without the integrated CV pipeline guardrails or RAG.
Once VLM processing initiates, the CV pipeline ceases real-time operation. The absence of workload isolation between VSS and CV components results in resource starvation for the CV pipeline.
VLM inference is highly compute-intensive and monopolizes GPU resources, leading to real-time CV processing failures.
We attempted to mitigate this by leveraging NVIDIA Multi-Process Service (MPS) with time-slicing for GPU scheduling; however, this approach did not resolve the contention—resource conflicts persist when VLM workloads are active.
Could you recommend alternative technologies or strategies to ensure workload isolation and maintain real-time CV performance under these conditions?

Hello!

Thank you for your post!

Please help answer my questions below for more information about your setup.

  • Which CV model and tracker are used in your DeepStream pipeline?

  • Which VLM model and checkpoint are being used for VSS?

  • How is VSS being deployed? i.e. helm chart, docker compose, any additional configurations?

  • On average, how many requests are made to VSS at a time?

In the meantime, here are a few configurations from the VSS side that you can try tuning to ease the GPU workload:

Please find responses to some of your questions below:

  1. Debugging:

Debugging: How can I effectively debug this interference? Are there specific profiling tools (Nsight Systems/Compute) or VSS logs that would pinpoint where the bottleneck is occurring (e.g., memory bandwidth, CUDA cores, NVDEC)?

The VSS Health Evaluation feature can be used to see GPU, NVDEC usage over the VSS VLM inference calls: VSS Observability — Video Search and Summarization Agent

  1. Optimization:
  • Are there DeepStream config parameters that help reduce this specific contention?

The VSS config parameters detailed above can help reduce GPU workload from the vLLM.

  • Is the Python API adding significant overhead compared to C++ in this high-load scenario?

The Python API does not add significant overhead compared to C++ in DeepStream applications.

  1. Hardware Upgrade:

Hardware Path: We are considering future hardware upgrades. Would migrating to the NVIDIA AGX Thor platform likely resolve these concurrency issues, or is this purely a software configuration problem?

It’s certainly possible that upgrading the hardware would help. As you mentioned, the 2xL40s setup you tried didn’t have this performance issue, but we can’t further comment about any performance expectation on AGX Thor without some details about your CV pipeline and VSS configuration.

Please let me know what does and doesn’t work, and if you have any other questions.

Thank you!

Hi @camilleh,

Thank you for your response. To answer your questions:

1. Which CV model and tracker are used? We are using DeepStream with GStreamer. Initially, we were using the DINO-based “Retail Object Detector,” but we have since identified this as the bottleneck (details below) and switched to a YOLO-based model.

2. Which VLM model and checkpoint are being used for VSS? We are using the openai-compat endpoint with Qwen3-VL-8B-Instruct.

3. How is VSS being deployed? VSS is deployed via Docker, as this is currently the only supported method for the “Event Reviewer” mode.

4. On average, how many requests are made to VSS at a time? The load is relatively low; on average, there is 1 request every 15 seconds.

Update on the issue: We will definitely look into the configuration options and advice you suggested for further optimization. However, we have successfully resolved the high latency and GPU contention by replacing the object detection model in our CV application.

It turned out that the DINO-based “Retail Object Detector” was too resource-intensive for our current setup. After switching the detector to YOLO, the GPU load for the CV application dropped significantly to around 4%. The application now runs smoothly with no delays. Additionally, we launched nvClip as a service, and it is running without introducing any contention. We kept VSS in its original configuration, and it is now working correctly alongside the lighter CV pipeline.

Thanks again for your support!