Hi
we use mps function at thor. but the mps server meet sticky error if one process have segmentation fault.
for example, process B is normal program, process A have segmentation fault. if we execute these two demo in mps, mps server will meet sticky error.
is there any solution that Processes do not affect each other?
attach is a demo that can reproduce this issue.
operation step:
- ./make.sh
- nvidia-cuda-mps-control -f
- execute test.sh for 2 or 3 times, the mps server will meet sticky error.
[2025-11-13 17:01:56.335 Control 154480] NEW CLIENT 154579 from user 2002: Server already exists
[2025-11-13 17:01:56.335 Server 154541] Received new client request for {PID: 154577, Context ID: 1}
[2025-11-13 17:01:56.335 Server 154541] Client {PID: 154579, Context ID: 1} connected
[2025-11-13 17:01:56.335 Server 154541] Creating worker thread for client {PID: 154579, Context ID: 1}
[2025-11-13 17:01:56.335 Server 154541] Device NVIDIA Thor (uuid GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600) is associated
[2025-11-13 17:01:56.335 Server 154541] Status of client {PID: 154579, Context ID: 1} is ACTIVE
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context: 1} encountered a fatal GPU error.
[2025-11-13 17:01:56.430 Server 154541] Server is handling a fatal GPU error.
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154577, Context ID: 1} on device 0 to be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context ID: 1} on device 0 to be torn down
[2025-11-13 17:01:56.430 Server 154541] All clients belonging to error trigging process 154577 will be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154577, Context ID: 1} on the error-trigging process 154577 to be torn down
[2025-11-13 17:01:56.430 Server 154541] All clients belonging to error trigging process 154579 will be torn down
[2025-11-13 17:01:56.430 Server 154541] Client {PID: 154579, Context ID: 1} on the error-trigging process 154579 to be torn down
[2025-11-13 17:01:56.430 Server 154541] The following devices will be reset:
[2025-11-13 17:01:56.430 Server 154541] 0
[2025-11-13 17:01:56.430 Server 154541] The following client process have a sticky error set:
[2025-11-13 17:01:56.430 Server 154541] 154577
[2025-11-13 17:01:56.430 Server 154541] 154579
[2025-11-13 17:01:56.471 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154579, Context ID: 1} exit
[2025-11-13 17:01:56.471 Server 154541] Client {PID: 154579, Context ID: 1} exiting. Number of active client contexts is 1.
[2025-11-13 17:01:56.474 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154579, Context ID: 0} exit
[2025-11-13 17:01:56.474 Server 154541] Client process 154579 disconnected
[2025-11-13 17:01:56.499 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154577, Context ID: 1} exit
[2025-11-13 17:01:56.499 Server 154541] Client {PID: 154577, Context ID: 1} exiting. Number of active client contexts is 0.
[2025-11-13 17:01:56.499 Server 154541] Destroy server context on device 0 (NVIDIA Thor)
[2025-11-13 17:01:56.737 Server 154541] Server failed to recevie command with status 806, assuming client {PID: 154577, Context ID: 0} exit
[2025-11-13 17:01:56.737 Server 154541] Client process 154577 disconnected
mps.gz (10 KB)