Sudden segmentation fault when using MPS daemon

valafon · March 2, 2023, 12:36pm

We have a server with NVIDIA graphics cards. We have installed MPS daemons on it. We also have Nomad installed on the same server, which runs Docker containers. Inside the Docker containers, a process is launched that connects to the MPS daemon through a socket. This entire scheme works, for example, we can limit memory consumption through the CUDA_MPS_PINNED_DEVICE_MEM_LIMIT variable. Thus, we have about 20 containers running on 4 different servers. They all work, but suddenly, at some random moment, one of the applications disconnects and cannot restart, crashing with a “Segmentation Fault” error. In the daemon logs, we see this -

Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] Volta MPS Server: Received new client request
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] MPS Server: worker created
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.489 Other 3675] Volta MPS: Creating worker thread
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.576 Other 3675] Receive command failed, assuming client exit
Mar 02 07:15:46 recognition-02 nvidia-cuda-mps-control[1875]: [2023-03-02 07:15:46.576 Other 3675] Volta MPS: Client process disconnected.

At the same time, another container connected to the same MPS daemon continues to work correctly. Restarting the container that encountered the segmentation fault error does not help, it is already impossible to start it again after that. The only thing that helps is to restart the MPS daemon, which the container is configured to use. Moreover, this happens randomly on all 4 servers and on random Docker containers. The servers have this driver - NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6. Can you suggest what we can try to do?

Robert_Crovella · March 6, 2023, 10:11pm

A seg fault always occurs as a result of a specific line of host code. You could use tools such as a debugger or valgrind to identify a specific line of code that is causing the seg fault, and then see if you can inspect any of the data being used by that line of code to see if it might obviously imply a seg fault (such as an invalid pointer).
bugs in software provided by NVIDIA are always possible. You could try updating the GPU driver, or the GPU driver and CUDA version, to something newer, and see if the problem persists.
Develop a self contained, short, complete test case that reproduces the issue, that would allow you to post on a public forum to see if anyone can help or spot something.
Check system logs for other errors that may be occurring around the same time. Inspect other system parameters such as GPU or other system temperatures, power draw, etc. to see if anything seems to be out of normal operating range.
Review the MPS documentation for any specifics that may be important when using it with docker, for example the usage of --ipc=host.

Topic		Replies	Views
Get a Segmentation fault in MPS CUDA Programming and Performance cuda	0	80	January 22, 2025
Mps server meet sticky error when one process have segmentation fault Jetson Thor cuda	7	92	November 20, 2025
SIGSEGV in libcuda.so.1 when MPS enabled CUDA Programming and Performance	0	666	November 9, 2021
Influence of MPS on Applications CUDA Programming and Performance	0	324	April 26, 2022
Segmentation Fault after rebooting CUDA Programming and Performance	2	1392	August 11, 2011
pre-volta MPS test failed with error: mapping of buffer object failed CUDA Programming and Performance	3	1258	June 13, 2019
MPS on Multi-RTX servers CUDA Programming and Performance	0	369	May 12, 2020
Segmentation fault when running CUDA code CUDA on Windows Subsystem for Linux	2	1457	September 10, 2020
Segmentation Fault, off and on CUDA Programming and Performance	1	745	September 18, 2011
Segmentation fault when using cudaMemcpy CUDA Programming and Performance	0	1900	May 17, 2009

Sudden segmentation fault when using MPS daemon

Related topics