Multi-Process freeze with docker

nunduniel · December 1, 2021, 2:36am

Hi,

I have an application where I have 2 processes.
One processes and composites the video coming from 1 or many sources of an FPGA.
It uses cuStreamWaitValue32 to get signaled by the FPGA when data is available (per DMA) to be processed.

Each source has it’s own processing thread, and will signal per IPC when a frame is fully processed and available for downstream consumers.

The consumer is in another thread and consumes those frames , e.g. for encoding or inference.

Each of those run in a separate process and in different docker containers.
All works fine and it can run for hours/days without issue.

To utilize the GPU better, I want to run with MPS, so that all kernel calls appear to come from one context (MPS-Server) and hence are allowed to overlap, instead time-slice.

I run after boot:

nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

On the docker images I run with --ipc=host and -v /tmp/nvidia-mps:/tmp/nvidia-mps, so both containers can talk with the MPS server on the host.

I can see the host MPS server spawn, and the processes running.
But after like 10 sec to 5 min the entire process and GPU freezes, and not even nvidia-smi would return from a call.
Without MPS it all works fine.

There was nothing interesting in the log under /var/nvidia-mps/control.log or server.log.

Does anyone experience something similar or know what could be the culprit?
Is MPS just not working with docker?
It seems like something might be leaking or deadlocking…

Greetings

federico.domeniconi · August 31, 2023, 3:41pm

I’m having a similar problem. When I launch MPS and run torch models on host machine it works fine, but I run the torch models on docker it freezes. I can’t even stop MPS with echo quit | nvidia-cuda-mps-control and everything else that uses the GPU throw errors and I’m forced to reboot the host machine.

Topic		Replies	Views
MPS is not working CUDA Programming and Performance	7	3115	July 13, 2022
Is these processes are computed parallelly using MPS? General	3	693	November 22, 2019
Docker pause leads to monopolizing GPU when Volta MPS on CUDA Programming and Performance cuda , docker	0	493	November 2, 2022
pre-volta MPS test failed with error: mapping of buffer object failed CUDA Programming and Performance	3	1180	June 13, 2019
Process freezes at cudaEventSynchronize when CUDA IPC and MPS service are both used CUDA Programming and Performance	0	552	October 12, 2016
MPS client hangs CUDA Programming and Performance	3	1668	October 27, 2022
Mutli Process Service crashes on setting up the `CUDA_MPS_ACTIVE_THREAD_PERCENTAGE` when launching a huge number of processes (say around 40~48 ) CUDA Programming and Performance cuda , kernel , gpu , gpu-computing	0	719	August 11, 2023
Question about GPU sharing of Multi-process service CUDA Programming and Performance	9	6526	April 30, 2018
cuda kernels from different process can run concurrently? same performance with MPS on and off? CUDA Programming and Performance	9	2085	May 3, 2018
MULTI-PROCESS SERVICE(MPS) has no effect CUDA Programming and Performance	3	808	October 16, 2018

Multi-Process freeze with docker

Related topics