Sudden segmentation fault when using MPS daemon

We have a server with NVIDIA graphics cards. We have installed MPS daemons on it. We also have Nomad installed on the same server, which runs Docker containers. Inside the Docker containers, a process is launched that connects to the MPS daemon through a socket. This entire scheme works, for example, we can limit memory consumption through the CUDA_MPS_PINNED_DEVICE_MEM_LIMIT variable. Thus, we have about 20 containers running on 4 different servers. They all work, but suddenly, at some random moment, one of the applications disconnects and cannot restart, crashing with a “Segmentation Fault” error. In the daemon logs, we see this -

Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] Volta MPS Server: Received new client request
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] MPS Server: worker created
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] Volta MPS: Creating worker thread
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.576 Other 3675] Receive command failed, assuming client exit
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.576 Other 3675] Volta MPS: Client process disconnected.

At the same time, another container connected to the same MPS daemon continues to work correctly. Restarting the container that encountered the segmentation fault error does not help, it is already impossible to start it again after that. The only thing that helps is to restart the MPS daemon, which the container is configured to use. Moreover, this happens randomly on all 4 servers and on random Docker containers. The servers have this driver - NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6. Can you suggest what we can try to do?

  1. A seg fault always occurs as a result of a specific line of host code. You could use tools such as a debugger or valgrind to identify a specific line of code that is causing the seg fault, and then see if you can inspect any of the data being used by that line of code to see if it might obviously imply a seg fault (such as an invalid pointer).

  2. bugs in software provided by NVIDIA are always possible. You could try updating the GPU driver, or the GPU driver and CUDA version, to something newer, and see if the problem persists.

  3. Develop a self contained, short, complete test case that reproduces the issue, that would allow you to post on a public forum to see if anyone can help or spot something.

  4. Check system logs for other errors that may be occurring around the same time. Inspect other system parameters such as GPU or other system temperatures, power draw, etc. to see if anything seems to be out of normal operating range.

  5. Review the MPS documentation for any specifics that may be important when using it with docker, for example the usage of --ipc=host.