I have a 4 gpus(5880 ada *4) workstation. I wanna use MPS to improve GPU memory usage.
But my job only need 2G memory. I saw mps only support less than 48 client, so I try to deploy MPS server for per gpu.
I use this command to start mps:
export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-0
export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps-0
sudo mkdir -p /tmp/nvidia-mps-0 /var/log/nvidia-mps-0
sudo chmod 777 /tmp/nvidia-mps-0 /var/log/nvidia-mps-0
sudo nvidia-cuda-mps-control -d
export CUDA_VISIBLE_DEVICES=1
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-1
export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps-1
sudo mkdir -p /tmp/nvidia-mps-1 /var/log/nvidia-mps-1
sudo chmod 777 /tmp/nvidia-mps-1 /var/log/nvidia-mps-1
sudo nvidia-cuda-mps-control -d
export CUDA_VISIBLE_DEVICES=2
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-2
export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps-2
sudo mkdir -p /tmp/nvidia-mps-2 /var/log/nvidia-mps-2
sudo chmod 777 /tmp/nvidia-mps-2 /var/log/nvidia-mps-2
sudo nvidia-cuda-mps-control -d
export CUDA_VISIBLE_DEVICES=3
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-3
export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps-3
sudo mkdir -p /tmp/nvidia-mps-3 /var/log/nvidia-mps-3
sudo chmod 777 /tmp/nvidia-mps-3 /var/log/nvidia-mps-3
sudo nvidia-cuda-mps-control -d
then it look so well, I found 4 different mps control process.
root 2842080 1 0 09:41 ? 00:00:01 nvidia-cuda-mps-control -d
root 2842090 1 0 09:42 ? 00:00:01 nvidia-cuda-mps-control -d
root 2842098 1 0 09:43 ? 00:00:01 nvidia-cuda-mps-control -d
root 2842104 1 0 09:44 ? 00:00:01 nvidia-cuda-mps-control -d
then I try to run my work like old time.
docker run -d --gpus ‘“device=1”’ -e CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-1 -e CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps-1 -v /tmp/nvidia-mps-1/:/tmp/nvidia-mps-1 -v /var/log/nvidia-mps-1/:/var/log/nvidia-mps-1/ xxxxx
Then I got the Segmentation fault.
[bcb510b93cba:355 :0:355] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
==== backtrace (tid: 355) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x00000000002b91ea cuEGLApiInit() ???:0
2 0x000000000024e906 cuEGLApiInit() ???:0
3 0x00000000002c7208 cuDeviceGetDefaultMemPool() ???:0
4 0x0000000000034b5a __cudaRegisterUnifiedTable() ???:0
5 0x0000000000037f20 __cudaRegisterUnifiedTable() ???:0
6 0x0000000000099ee8 pthread_mutexattr_setkind_np() ???:0
7 0x000000000007e139 cudaGraphicsVDPAURegisterOutputSurface() ???:0
8 0x0000000000028e4f __cudaRegisterUnifiedTable() ???:0
9 0x000000000004b7fa cudaGetDeviceCount() ???:0
10 0x0000000000048fc2 c10::cuda::device_count() ???:0
11 0x0000000000af96f3 THCPModule_getDeviceCount_wrap() ???:0
12 0x000000000015d64e PyObject_GetAttr() ???:0
13 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0
14 0x000000000016070c _PyFunction_Vectorcall() ???:0
15 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0
16 0x0000000000239e56 PyEval_EvalCode() ???:0
17 0x0000000000239cf6 PyEval_EvalCode() ???:0
18 0x00000000002647d8 PyUnicode_Tailmatch() ???:0
19 0x000000000025e0bb PyInit__collections() ???:0
20 0x00000000000b74d0 _PyRun_InteractiveLoopObject() ???:0
21 0x00000000000b7012 _PyRun_InteractiveLoopObject() ???:0
22 0x0000000000263678 _PyRun_AnyFileObject() ???:0
23 0x00000000000a15c8 PyRun_AnyFileExFlags() ???:0
24 0x00000000000966e8 _Py_str_to_int() ???:0
25 0x000000000022ccad Py_BytesMain() ???:0
26 0x0000000000029d90 __libc_init_first() ???:0
27 0x0000000000029e40 __libc_start_main() ???:0
28 0x000000000022cba5 _start() ???:0
Segmentation fault (core dumped)
I can see gpu info in nvidia-smi in the container. I can see mps control start a mps server.
1 N/A N/A 2857384 C nvidia-cuda-mps-server 28MiB
access log is:
[2025-01-22 15:30:31.904 Control 2842090] NEW CLIENT 2857886 from user 0: Server already exists
[2025-01-22 15:30:32.916 Control 2842090] Accepting connection…
[2025-01-22 15:30:32.917 Control 2842090] User did not send valid credentials
[2025-01-22 15:30:32.917 Control 2842090] Accepting connection…
[2025-01-22 15:30:32.917 Control 2842090] NEW CLIENT 2857889 from user 0: Server already exists
[2025-01-22 15:30:33.930 Control 2842090] Accepting connection…
[2025-01-22 15:30:33.931 Control 2842090] User did not send valid credentials
[2025-01-22 15:30:33.931 Control 2842090] Accepting connection…
[2025-01-22 15:30:33.931 Control 2842090] NEW CLIENT 2857893 from user 0: Server already exists
server log is :
[2025-01-22 15:30:30.893 Server 2857384] Received new client request
[2025-01-22 15:30:30.893 Server 2857384] Worker created
[2025-01-22 15:30:30.893 Server 2857384] Creating worker thread
[2025-01-22 15:30:31.124 Server 2857384] Receive command failed, assuming client exit
[2025-01-22 15:30:31.124 Server 2857384] Client process disconnected
[2025-01-22 15:30:31.904 Server 2857384] Received new client request
[2025-01-22 15:30:31.904 Server 2857384] Worker created
[2025-01-22 15:30:31.904 Server 2857384] Creating worker thread
[2025-01-22 15:30:32.128 Server 2857384] Receive command failed, assuming client exit
[2025-01-22 15:30:32.128 Server 2857384] Client process disconnected
[2025-01-22 15:30:32.917 Server 2857384] Received new client request
[2025-01-22 15:30:32.918 Server 2857384] Worker created
[2025-01-22 15:30:32.918 Server 2857384] Creating worker thread
[2025-01-22 15:30:33.144 Server 2857384] Receive command failed, assuming client exit
[2025-01-22 15:30:33.144 Server 2857384] Client process disconnected
[2025-01-22 15:30:33.931 Server 2857384] Received new client request
I don’t know what happen. Can anybody help me?