I have an application where I have 2 processes.
One processes and composites the video coming from 1 or many sources of an FPGA.
It uses cuStreamWaitValue32 to get signaled by the FPGA when data is available (per DMA) to be processed.
Each source has it’s own processing thread, and will signal per IPC when a frame is fully processed and available for downstream consumers.
The consumer is in another thread and consumes those frames , e.g. for encoding or inference.
Each of those run in a separate process and in different docker containers.
All works fine and it can run for hours/days without issue.
To utilize the GPU better, I want to run with MPS, so that all kernel calls appear to come from one context (MPS-Server) and hence are allowed to overlap, instead time-slice.
I run after boot:
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS nvidia-cuda-mps-control -d
On the docker images I run with --ipc=host and -v /tmp/nvidia-mps:/tmp/nvidia-mps, so both containers can talk with the MPS server on the host.
I can see the host MPS server spawn, and the processes running.
But after like 10 sec to 5 min the entire process and GPU freezes, and not even nvidia-smi would return from a call.
Without MPS it all works fine.
There was nothing interesting in the log under /var/nvidia-mps/control.log or server.log.
Does anyone experience something similar or know what could be the culprit?
Is MPS just not working with docker?
It seems like something might be leaking or deadlocking…