High Latency and GPU Contention when running DeepStream (Python) + VSS on DGX Platform

piotr.kasierski · January 9, 2026, 12:50pm

Hi everyone,

I am running a NVIDIA DeepStream pipeline using Python bindings on a NVIDIA DGX platform. I am encountering significant performance degradation and resource contention issues that I hope to get some insight on.

The Setup:

Platform: NVIDIA DGX (Spark DGX)
Software: DeepStream SDK (Python bindings)
Input: Single 1280x720 RTSP stream
Pipeline: RTSP Source → Decoder → nvinfer → Tracker → OSD → Sink
Additional Service: NVIDIA VSS (Video Storage Service/Toolkit)

The Problem: We are observing a drastic difference in performance depending on whether the VSS service is active.

Scenario A (CV App Only): When running the Computer Vision application in isolation, everything works perfectly. The pipeline is smooth, and there is no lag.
- See attached screenshots for SM Load and htop in this scenario:
  
  Screenshot from 2026-01-09 13-23-09473×835 27.4 KB
  
  Screenshot from 2026-01-09 13-21-542464×1292 511 KB
Scenario B (CV App + VSS): As soon as we enable NVIDIA VSS alongside the CV application, the performance degrades significantly. The application becomes very slow, and we observe massive delays and accumulated latency in stream processing. The GPU compute load spikes disproportionately (often hitting ~90% utilization or causing bottlenecks), even though the scene complexity hasn’t changed.
- See attached screenshots for SM Load and htop in this scenario:
  
  Screenshot from 2026-01-09 13-34-57473×835 28.5 KB
  
  Screenshot from 2026-01-09 13-35-532466×1274 510 KB

Comparison with other hardware: Interestingly, we tested this exact same application configuration on a Brev.dev instance equipped with 2x L40S GPUs, and it worked correctly without these performance penalties. This suggests the issue might be specific to the DGX environment configuration or resource management between VSS and DeepStream on this specific hardware.

My Questions:

Root Cause Analysis: What could be causing such high contention between VSS and the DeepStream Python app on the DGX platform specifically? It feels like they are fighting for the same resources despite the hardware being powerful.
Debugging: How can I effectively debug this interference? Are there specific profiling tools (Nsight Systems/Compute) or VSS logs that would pinpoint where the bottleneck is occurring (e.g., memory bandwidth, CUDA cores, NVDEC)?
Optimization:
- Are there DeepStream config parameters that help reduce this specific contention?
- Is the Python API adding significant overhead compared to C++ in this high-load scenario?
Hardware Path: We are considering future hardware upgrades. Would migrating to the NVIDIA AGX Thor platform likely resolve these concurrency issues, or is this purely a software configuration problem?

Any guidance, configuration tips, or debugging steps would be greatly appreciated!

Thanks!

lukasz.pawlik · January 13, 2026, 11:00am

NVIDIA VSS is deployed in Event Reviewer mode without the integrated CV pipeline guardrails or RAG.
Once VLM processing initiates, the CV pipeline ceases real-time operation. The absence of workload isolation between VSS and CV components results in resource starvation for the CV pipeline.
VLM inference is highly compute-intensive and monopolizes GPU resources, leading to real-time CV processing failures.
We attempted to mitigate this by leveraging NVIDIA Multi-Process Service (MPS) with time-slicing for GPU scheduling; however, this approach did not resolve the contention—resource conflicts persist when VLM workloads are active.
Could you recommend alternative technologies or strategies to ensure workload isolation and maintain real-time CV performance under these conditions?

camilleh · January 22, 2026, 7:28pm

Hello!

Thank you for your post!

Please help answer my questions below for more information about your setup.

Which CV model and tracker are used in your DeepStream pipeline?
Which VLM model and checkpoint are being used for VSS?
How is VSS being deployed? i.e. helm chart, docker compose, any additional configurations?
On average, how many requests are made to VSS at a time?

In the meantime, here are a few configurations from the VSS side that you can try tuning to ease the GPU workload:

VLLM_GPU_MEMORY_UTILIZATION :

Fraction of GPU memory for VLLM. VSS Deployment-Time Configuration Glossary — Video Search and Summarization Agent

For single GPU docker compose local deployment, this is set by default to 0.3 in the .env file and can be modified: video-search-and-summarization/deploy/docker/local_deployment_single_gpu/.env at main · NVIDIA-AI-Blueprints/video-search-and-summarization · GitHub

For helm deployment, please follow the steps here to add an override.yaml file with the configuration: Deploy Using Helm — Video Search and Summarization Agent. See here for an example overrides file for deploying the VSS helm chart on a single GPU: Deploy Using Helm — Video Search and Summarization Agent

Note: theoretically, you can set the value as low as 0, but setting it too low will cause the vLLM to fail at runtime because there won’t be enough GPU memory to load the model weights and allocate KV cache blocks. In such case you will see an error such as “Not enough memory to allocate KV cache”.

A practical lower bound depends on your model size and GPU memory.
Input vision token length (i.e. VLM_DEFAULT_NUM_FRAMES_PER_CHUNK,VLM_INPUT_WIDTH, VLM_INPUT_HEIGHT ): VSS Customization — Video Search and Summarization Agent
If using VSS API directly, the number max output tokens can also be also be configured through the max_tokens parameter: VSS Customization — Video Search and Summarization Agent

Please find responses to some of your questions below:

Debugging:

Debugging: How can I effectively debug this interference? Are there specific profiling tools (Nsight Systems/Compute) or VSS logs that would pinpoint where the bottleneck is occurring (e.g., memory bandwidth, CUDA cores, NVDEC)?

The VSS Health Evaluation feature can be used to see GPU, NVDEC usage over the VSS VLM inference calls: VSS Observability — Video Search and Summarization Agent

Optimization:

Are there DeepStream config parameters that help reduce this specific contention?

The VSS config parameters detailed above can help reduce GPU workload from the vLLM.

Is the Python API adding significant overhead compared to C++ in this high-load scenario?

The Python API does not add significant overhead compared to C++ in DeepStream applications.

Hardware Upgrade:

Hardware Path: We are considering future hardware upgrades. Would migrating to the NVIDIA AGX Thor platform likely resolve these concurrency issues, or is this purely a software configuration problem?

It’s certainly possible that upgrading the hardware would help. As you mentioned, the 2xL40s setup you tried didn’t have this performance issue, but we can’t further comment about any performance expectation on AGX Thor without some details about your CV pipeline and VSS configuration.

Please let me know what does and doesn’t work, and if you have any other questions.

Thank you!

piotr.kasierski · January 23, 2026, 11:25am

Hi @camilleh,

Thank you for your response. To answer your questions:

1. Which CV model and tracker are used? We are using DeepStream with GStreamer. Initially, we were using the DINO-based “Retail Object Detector,” but we have since identified this as the bottleneck (details below) and switched to a YOLO-based model.

2. Which VLM model and checkpoint are being used for VSS? We are using the openai-compat endpoint with Qwen3-VL-8B-Instruct.

3. How is VSS being deployed? VSS is deployed via Docker, as this is currently the only supported method for the “Event Reviewer” mode.

4. On average, how many requests are made to VSS at a time? The load is relatively low; on average, there is 1 request every 15 seconds.

Update on the issue: We will definitely look into the configuration options and advice you suggested for further optimization. However, we have successfully resolved the high latency and GPU contention by replacing the object detection model in our CV application.

It turned out that the DINO-based “Retail Object Detector” was too resource-intensive for our current setup. After switching the detector to YOLO, the GPU load for the CV application dropped significantly to around 4%. The application now runs smoothly with no delays. Additionally, we launched nvClip as a service, and it is running without introducing any contention. We kept VSS in its original configuration, and it is now working correctly alongside the lighter CV pipeline.

Thanks again for your support!

Topic		Replies	Views
Docker DeepStream 5.0 runs quite slow DeepStream SDK docker	3	1244	June 22, 2020
Lag in RTSP streams in new deepstream DeepStream SDK	7	710	November 14, 2023
DeepStream Performance Issue: 1s Latency and Periodic Stutter with RTSP Streams DeepStream SDK gstreamer , deepstream	14	350	August 22, 2025
Sudden high latenty in deepstream DeepStream SDK deepstream	16	298	May 27, 2025
Slowdown when two DeepStream processes are running on the same VM DeepStream SDK	1	635	July 19, 2021
Deepstream TensorRT	2	406	November 27, 2023
Did you have some sulotion to get how many GPU memory the deepstream use DeepStream SDK	7	186	August 27, 2024
Deepstream-app vs gstreamer DeepStream SDK gstreamer , deepstream	2	842	August 29, 2022
Deepstream-app vs gstreamer DeepStream SDK gstreamer , deepstream	4	1945	August 17, 2022
Performance of gst-dsexample deepstream DeepStream SDK	5	635	July 1, 2022

High Latency and GPU Contention when running DeepStream (Python) + VSS on DGX Platform

Related topics