Mps server meet sticky error when one process have segmentation fault

liuting11 · November 13, 2025, 9:18am

Hi

we use mps function at thor. but the mps server meet sticky error if one process have segmentation fault.

for example, process B is normal program, process A have segmentation fault. if we execute these two demo in mps, mps server will meet sticky error.
is there any solution that Processes do not affect each other?
attach is a demo that can reproduce this issue.

operation step:

./make.sh
nvidia-cuda-mps-control -f
execute test.sh for 2 or 3 times, the mps server will meet sticky error.

[2025-11-13 17:01:56.335 Control 154480] NEW CLIENT 154579 from user 2002: Server already exists
[2025-11-13 17:01:56.335 Server 154541] Received new client request for {PID: 154577, Context ID: 1}
[2025-11-13 17:01:56.335 Server 154541] Client {PID: 154579, Context ID: 1} connected
[2025-11-13 17:01:56.335 Server 154541] Creating worker thread for client {PID: 154579, Context ID: 1}
[2025-11-13 17:01:56.335 Server 154541] Device NVIDIA Thor (uuid GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600) is associated
[2025-11-13 17:01:56.335 Server 154541] Status of client {PID: 154579, Context ID: 1} is ACTIVE
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context: 1} encountered a fatal GPU error.
[2025-11-13 17:01:56.430 Server 154541] Server is handling a fatal GPU error.
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154577, Context ID: 1} on device 0 to be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context ID: 1} on device 0 to be torn down
[2025-11-13 17:01:56.430 Server 154541] All clients belonging to error trigging process 154577 will be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154577, Context ID: 1} on the error-trigging process 154577 to be torn down
[2025-11-13 17:01:56.430 Server 154541] All clients belonging to error trigging process 154579 will be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context ID: 1} on the error-trigging process 154579 to be torn down
[2025-11-13 17:01:56.430 Server 154541] The following devices will be reset:
[2025-11-13 17:01:56.430 Server 154541] 0
[2025-11-13 17:01:56.430 Server 154541] The following client process have a sticky error set:
[2025-11-13 17:01:56.430 Server 154541] 154577
[2025-11-13 17:01:56.430 Server 154541] 154579
[2025-11-13 17:01:56.471 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154579, Context ID: 1} exit
[2025-11-13 17:01:56.471 Server 154541] Client {PID: 154579, Context ID: 1} exiting. Number of active client contexts is 1.
[2025-11-13 17:01:56.474 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154579, Context ID: 0} exit
[2025-11-13 17:01:56.474 Server 154541] Client process 154579 disconnected
[2025-11-13 17:01:56.499 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154577, Context ID: 1} exit
[2025-11-13 17:01:56.499 Server 154541] Client {PID: 154577, Context ID: 1} exiting. Number of active client contexts is 0.
[2025-11-13 17:01:56.499 Server 154541] Destroy server context on device 0 (NVIDIA Thor)
[2025-11-13 17:01:56.737 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154577, Context ID: 0} exit
[2025-11-13 17:01:56.737 Server 154541] Client process 154577 disconnected

mps.gz (10 KB)

AastaLLL · November 14, 2025, 5:50am

Hi,

Thanks for reporting this.
We will try to reproduce this locally and share more information with you.

AastaLLL · November 17, 2025, 6:04am

Hi,

We test your sample on our local setup but fail to reproduce this issue.
Could you check if anything is missing in our steps?

Enable MPS (console1)

$ export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
$ export CUDA_MPS_LOG_DIRECTORY=/tmp/mps
$ sudo -E nvidia-cuda-mps-control -f
[2025-11-17 05:59:25.839 Control 481299] Starting control daemon using socket /tmp/mps/control
[2025-11-17 05:59:25.839 Control 481299] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
[2025-11-17 05:59:25.839 Control 481299] CUDA MPS Control binary version: 13000

Test the test (console2)

$ ./test.sh 
Launching small kernel...
Launching long-running kernel...
Done, err=no error

Console 2 is repeated around 10 times but no error is shown.
However, we don’t see the connection logs from console 1.

Is there any missing in our setting?

Thanks.

liuting11 · November 17, 2025, 7:05am

Hi, Aasta

this is a probabilistic issue. please replace crash_client.cu with below code, it is very easy to let the mps server meet sticky error.

#include <cuda_runtime.h>
#include <iostream>

__global__ void crashKernel() {
    int *ptr = (int*)0xFFFFFFFF;   // 非法显存地址
    ptr[0] = 123;                  // 强制触发 GPU fault
}

int main() {
    std::cout << "Launching illegal memory kernel..." << std::endl;

    crashKernel<<<1, 1>>>();
    cudaDeviceSynchronize();  // 一定报错，但可能阻塞

    return 0;
}

liuting11 · November 17, 2025, 7:14am

if you don’t see connection logs. please do below operation.

console 2
sudo rm -rf /tmp/mps
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps
nvidia-cuda-mps-control -f
console 1
./test.sh for several times
please replace crash_client.cu with above code.

AastaLLL · November 19, 2025, 4:48am

Hi,

Thanks for the help.

We can see the error and are now checking with our internal team.
Will provide more information to you later.

AastaLLL · November 20, 2025, 3:23am

Hi,

After confirming with our internal team, this is an expected behavior.

MPS doesn’t support any sort of error isolation currently.
So if the process 1 faults, then all the MPS clients are affected.

Thanks.

system · December 16, 2025, 1:46am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get a Segmentation fault in MPS CUDA Programming and Performance cuda	0	80	January 22, 2025
MPS Error Containment - Can't able to create new process CUDA Programming and Performance	7	600	December 29, 2022
Sudden segmentation fault when using MPS daemon CUDA Programming and Performance	1	814	March 6, 2023
Fail to launch CUDA-MPS CUDA Programming and Performance	9	8749	October 26, 2015
Influence of MPS on Applications CUDA Programming and Performance	0	324	April 26, 2022
MPI running issue using NVIDIA MPS Service on Multi-GPU nodes CUDA Programming and Performance	4	2318	September 16, 2016
Process not running with MPS CUDA Programming and Performance	0	324	January 11, 2024
SIGSEGV in libcuda.so.1 when MPS enabled CUDA Programming and Performance	0	666	November 9, 2021
MPS client failed to reserve virtual memory range at address (nil) CUDA Programming and Performance	2	923	January 11, 2020
MPS on Multi-RTX servers CUDA Programming and Performance	0	369	May 12, 2020

Mps server meet sticky error when one process have segmentation fault

Related topics