High Latency and GPU Contention when running DeepStream (Python) + VSS on DGX Platform

Hi everyone,

I am running a NVIDIA DeepStream pipeline using Python bindings on a NVIDIA DGX platform. I am encountering significant performance degradation and resource contention issues that I hope to get some insight on.

The Setup:

  • Platform: NVIDIA DGX (Spark DGX)

  • Software: DeepStream SDK (Python bindings)

  • Input: Single 1280x720 RTSP stream

  • Pipeline: RTSP Source → Decoder → nvinfer → Tracker → OSD → Sink

  • Additional Service: NVIDIA VSS (Video Storage Service/Toolkit)

The Problem: We are observing a drastic difference in performance depending on whether the VSS service is active.

  1. Scenario A (CV App Only): When running the Computer Vision application in isolation, everything works perfectly. The pipeline is smooth, and there is no lag.

  2. Scenario B (CV App + VSS): As soon as we enable NVIDIA VSS alongside the CV application, the performance degrades significantly. The application becomes very slow, and we observe massive delays and accumulated latency in stream processing. The GPU compute load spikes disproportionately (often hitting ~90% utilization or causing bottlenecks), even though the scene complexity hasn’t changed.

Comparison with other hardware: Interestingly, we tested this exact same application configuration on a Brev.dev instance equipped with 2x L40S GPUs, and it worked correctly without these performance penalties. This suggests the issue might be specific to the DGX environment configuration or resource management between VSS and DeepStream on this specific hardware.

My Questions:

  1. Root Cause Analysis: What could be causing such high contention between VSS and the DeepStream Python app on the DGX platform specifically? It feels like they are fighting for the same resources despite the hardware being powerful.

  2. Debugging: How can I effectively debug this interference? Are there specific profiling tools (Nsight Systems/Compute) or VSS logs that would pinpoint where the bottleneck is occurring (e.g., memory bandwidth, CUDA cores, NVDEC)?

  3. Optimization:

    • Are there DeepStream config parameters that help reduce this specific contention?

    • Is the Python API adding significant overhead compared to C++ in this high-load scenario?

  4. Hardware Path: We are considering future hardware upgrades. Would migrating to the NVIDIA AGX Thor platform likely resolve these concurrency issues, or is this purely a software configuration problem?

Any guidance, configuration tips, or debugging steps would be greatly appreciated!

Thanks!

NVIDIA VSS is deployed in Event Reviewer mode without the integrated CV pipeline guardrails or RAG.
Once VLM processing initiates, the CV pipeline ceases real-time operation. The absence of workload isolation between VSS and CV components results in resource starvation for the CV pipeline.
VLM inference is highly compute-intensive and monopolizes GPU resources, leading to real-time CV processing failures.
We attempted to mitigate this by leveraging NVIDIA Multi-Process Service (MPS) with time-slicing for GPU scheduling; however, this approach did not resolve the contention—resource conflicts persist when VLM workloads are active.
Could you recommend alternative technologies or strategies to ensure workload isolation and maintain real-time CV performance under these conditions?