Hi all,
I could pinpoint an issue with our Deepstream pipeline, where in a multi GPU environment some of the pipes stalled at startup
The setup involves reservation of single GPU per container and using several containers, one per GPU / instance of deepstream
I was able to reproduce the problem without any code of ours, with this docker-compose :
services:
cam0:
image: nvcr.io/nvidia/deepstream:7.1-triton-multiarch
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["GPU-64955b11-02d1-f07f-93a1-660a632e1a56"]
capabilities: [compute, utility, video, graphics]
entrypoint:
- gst-launch-1.0
- uridecodebin
- uri=rtsp://alfred:alfred1326@10.22.56.200:554/cam/realmonitor?channel=1&subtype=0
- name=srcbin
- '!'
- fakesink
cam1:
image: nvcr.io/nvidia/deepstream:7.1-triton-multiarch
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["GPU-40253b63-70d6-fd7f-e0a0-a060dc869950"]
capabilities: [compute, utility, video, graphics]
entrypoint:
- gst-launch-1.0
- uridecodebin
- uri=rtsp://alfred:alfred1326@10.22.56.200:554/cam/realmonitor?channel=1&subtype=0
- name=srcbin
- '!'
- fakesink
Change the device ids and the RTSP urls, and the problem should appear.
nvidia-smi dmon
will confirm that only a single GPU uses its hardware decoder
Using a different sink will confirm that only a single GPU is actually running
Doing the same thing in a single docker container, though, will not cause any problem :
if I run that image, and exec the following processes simultaneously inside of it :
CUDA_VISIBLE_DEVICES=0 gst-launch-1.0 uridecodebin uri=rtsp://alfred:alfred1326@10.22.56.200:554/cam/realmonitor?channel=1&subtype=0 ! fakesink
CUDA_VISIBLE_DEVICES=1 gst-launch-1.0 uridecodebin uri=rtsp://alfred:alfred1326@10.22.56.200:554/cam/realmonitor?channel=1&subtype=0 ! fakesink
everything runs correctly
Now we plan, for now, to just switch to a single container obviously, but we find it slightly unsatisfying to have to do that, when Docker expressedly encourages a single process per container…
Can you confirm the buggy behavior? Any chance it could be fixed, assuming the problem is somewhere in your IP?