Machine Specs:
GPU: Nvidia Tesla T4 on Amazon AWS ECS
Driver Version: 535.161.08
CUDA Version: 12.2/12.3.1/12.8.1 (I tried 3 different versions, neither of them works)
Issue Description:
Usually, when MPS control daemon get connection from client were launched with different user IDs, the control daemon requests the existing server to shutdown once all its clients have disconnected. Once the existing server has shutdown, the control daemon launches a new server with the same user ID as that of the new user’s client process.
Based on the latest MPS documentation, section 3.3.1 mentions that, on Volta MPS, the above restriction of one Linux user per MPS server can be relaxed. This allows multiple users to connect to the same MPS server without the need for reprovisioning. In this mode, clients from different Linux users will appear as clients of the root user and connect to the root MPS server. To allow multiple Linux users share one MPS server, start the control daemon under superuser with the -multiuser-server
option.
I can successfully turn on MPS control daemon without -multiuser-server
option. However, when I try to turn on the MPS control daemon with-multiuser-server
option, I got the following error.
/home/ubuntu$ sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:00:1E.0.
All done.
/home/ubuntu$ docker run -ti --rm -e NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia -v /tmp:/tmp --ipc=host nvidia/cuda:12.3.1-base-ubuntu20.04
root@d47e396066af:/# export CUDA_VISIBLE_DEVICES=0
root@d47e396066af:/# export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
root@d47e396066af:/# export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_0
root@d47e396066af:/# nvidia-cuda-mps-control -d -multiuser-server
Cannot find MPS control daemon process
I also find others got same issue. Turning on multiuser-server on Volta GPUs
Questions:
Is that the correct way to turn on the MPS control daemon with-multiuser-server
option?
Any insights would be greatly appreciated. Thanks in advance!