CUDA MPS Server blocks applications from starting on different GPUs even though, the MPS Server is specified to run only on a particular GPU.
Here’s the scenario
The server has 7 P100 GPUs (0-6), and the CUDA MPS server is started on GPU 0. Another application (based on Theano and Kreas) is started in GPU 3, however this application never starts. When the CUDA MPS server is stopped, applications run properly on different GPUs. The applications are exposed to particular GPUs using the CUDA_VISIBLE_DEVICES variable.
Here’s the log that I find in the mps-control log
[2017-10-27 08:08:51.723 Control 24078] Starting new server 24232 for user 1002
[2017-10-27 08:08:53.168 Control 24078] Accepting connection…
[2017-10-27 08:08:53.169 Control 24078] NEW SERVER 24232: Ignoring connection from user
[2017-10-27 08:08:53.919 Control 24078] Server 24232 exited with status 0
Does any one has encountered any such issue.?
Any help on this is greatly appreciated
CUDA version in our environment is 8.0, v8.0.61
Regards
Bharath
have you specified CUDA_VISIBLE_DEVICES when starting the MPS server?
have you placed the necessary GPUs in exclusive process mode?
are the non-MPS-managed GPUs in default compute mode?
Yes, I specify the CUDA_VISIBLE_DEVICES and place the specific GPU in exclusive_process mode
Scenario 1:
When I run the MPS server as a root and place the specific GPU in exclusive mode.
An update to this issue, when the stand alone processes are run setting the CUDA_VISIBLE_DEVICES=0, the processes are run as the clients to the MPS server.
On specifying the CUDA_VISIBLE_DEVICES > 0, the processes error out as no GPU available.
Scenario 2:
Now, when I run the MPS server as a non root user and don’t set the GPU in exclusive mode (set in default mode),
On setting the following environment variable
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
the processes that are run eventually connects to the MPS server
However, processes that are run on other GPUs (by setting appropriate CUDA_VISIBLE_DEVICES) are blocked and hangs
Any one has any suggestions / updates?
We upgraded the driver to the latest version (384.66) but still this issue exists.
Any help on this would be great
Regards
Bharath