We have a server with NVIDIA graphics cards. We have installed MPS daemons on it. We also have Nomad installed on the same server, which runs Docker containers. Inside the Docker containers, a process is launched that connects to the MPS daemon through a socket. This entire scheme works, for example, we can limit memory consumption through the CUDA_MPS_PINNED_DEVICE_MEM_LIMIT variable. Thus, we have about 20 containers running on 4 different servers. They all work, but suddenly, at some random moment, one of the applications disconnects and cannot restart, crashing with a “Segmentation Fault” error. In the daemon logs, we see this -
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] Volta MPS Server: Received new client request
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] MPS Server: worker created
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] Volta MPS: Creating worker thread
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.576 Other 3675] Receive command failed, assuming client exit
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.576 Other 3675] Volta MPS: Client process disconnected.
At the same time, another container connected to the same MPS daemon continues to work correctly. Restarting the container that encountered the segmentation fault error does not help, it is already impossible to start it again after that. The only thing that helps is to restart the MPS daemon, which the container is configured to use. Moreover, this happens randomly on all 4 servers and on random Docker containers. The servers have this driver - NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6. Can you suggest what we can try to do?