Mps server meet sticky error when one process have segmentation fault

Hi

we use mps function at thor. but the mps server meet sticky error if one process have segmentation fault.

for example, process B is normal program, process A have segmentation fault. if we execute these two demo in mps, mps server will meet sticky error.
is there any solution that Processes do not affect each other?
attach is a demo that can reproduce this issue.

operation step:

  1. ./make.sh
  2. nvidia-cuda-mps-control -f
  3. execute test.sh for 2 or 3 times, the mps server will meet sticky error.

[2025-11-13 17:01:56.335 Control 154480] NEW CLIENT 154579 from user 2002: Server already exists
[2025-11-13 17:01:56.335 Server 154541] Received new client request for {PID: 154577, Context ID: 1}
[2025-11-13 17:01:56.335 Server 154541] Client {PID: 154579, Context ID: 1} connected
[2025-11-13 17:01:56.335 Server 154541] Creating worker thread for client {PID: 154579, Context ID: 1}
[2025-11-13 17:01:56.335 Server 154541] Device NVIDIA Thor (uuid GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600) is associated
[2025-11-13 17:01:56.335 Server 154541] Status of client {PID: 154579, Context ID: 1} is ACTIVE
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context: 1} encountered a fatal GPU error.
[2025-11-13 17:01:56.430 Server 154541] Server is handling a fatal GPU error.
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154577, Context ID: 1} on device 0 to be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context ID: 1} on device 0 to be torn down
[2025-11-13 17:01:56.430 Server 154541] All clients belonging to error trigging process 154577 will be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154577, Context ID: 1} on the error-trigging process 154577 to be torn down
[2025-11-13 17:01:56.430 Server 154541] All clients belonging to error trigging process 154579 will be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context ID: 1} on the error-trigging process 154579 to be torn down
[2025-11-13 17:01:56.430 Server 154541] The following devices will be reset:
[2025-11-13 17:01:56.430 Server 154541] 0
[2025-11-13 17:01:56.430 Server 154541] The following client process have a sticky error set:
[2025-11-13 17:01:56.430 Server 154541] 154577
[2025-11-13 17:01:56.430 Server 154541] 154579
[2025-11-13 17:01:56.471 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154579, Context ID: 1} exit
[2025-11-13 17:01:56.471 Server 154541] Client {PID: 154579, Context ID: 1} exiting. Number of active client contexts is 1.
[2025-11-13 17:01:56.474 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154579, Context ID: 0} exit
[2025-11-13 17:01:56.474 Server 154541] Client process 154579 disconnected
[2025-11-13 17:01:56.499 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154577, Context ID: 1} exit
[2025-11-13 17:01:56.499 Server 154541] Client {PID: 154577, Context ID: 1} exiting. Number of active client contexts is 0.
[2025-11-13 17:01:56.499 Server 154541] Destroy server context on device 0 (NVIDIA Thor)
[2025-11-13 17:01:56.737 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154577, Context ID: 0} exit
[2025-11-13 17:01:56.737 Server 154541] Client process 154577 disconnected

mps.gz (10 KB)

Hi,

Thanks for reporting this.
We will try to reproduce this locally and share more information with you.

Hi,

We test your sample on our local setup but fail to reproduce this issue.
Could you check if anything is missing in our steps?

  1. Enable MPS (console1)
$ export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
$ export CUDA_MPS_LOG_DIRECTORY=/tmp/mps
$ sudo -E nvidia-cuda-mps-control -f
[2025-11-17 05:59:25.839 Control 481299] Starting control daemon using socket /tmp/mps/control
[2025-11-17 05:59:25.839 Control 481299] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
[2025-11-17 05:59:25.839 Control 481299] CUDA MPS Control binary version: 13000
  1. Test the test (console2)
$ ./test.sh 
Launching small kernel...
Launching long-running kernel...
Done, err=no error

Console 2 is repeated around 10 times but no error is shown.
However, we don’t see the connection logs from console 1.

Is there any missing in our setting?

Thanks.

Hi, Aasta

this is a probabilistic issue. please replace crash_client.cu with below code, it is very easy to let the mps server meet sticky error.

#include <cuda_runtime.h>
#include <iostream>

__global__ void crashKernel() {
    int *ptr = (int*)0xFFFFFFFF;   // 非法显存地址
    ptr[0] = 123;                  // 强制触发 GPU fault
}

int main() {
    std::cout << "Launching illegal memory kernel..." << std::endl;

    crashKernel<<<1, 1>>>();
    cudaDeviceSynchronize();  // 一定报错,但可能阻塞

    return 0;
}
 
 

if you don’t see connection logs. please do below operation.

  1. console 2
    sudo rm -rf /tmp/mps
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps
    export CUDA_MPS_LOG_DIRECTORY=/tmp/mps
    nvidia-cuda-mps-control -f
  2. console 1
    ./test.sh for several times
    please replace crash_client.cu with above code.

Hi,

Thanks for the help.

We can see the error and are now checking with our internal team.
Will provide more information to you later.

Hi,

After confirming with our internal team, this is an expected behavior.

MPS doesn’t support any sort of error isolation currently.
So if the process 1 faults, then all the MPS clients are affected.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.