CUDA MPS not allowing new jobs to start

dscerutti · February 21, 2019, 12:25am

Hello,

I was getting great mileage out of the MPS feature in recent CUDA versions on a machine featuring a V100, then another featuring a couple of RTX cards. However, when I try to replicate the success on other boxes, I find that it is impossible to start any CUDA jobs at all with MPS running.

The script I use to start MPS is simple:

#!/bin/bash

set -e
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

And the script I could stop it with is:

#!/bin/bash

echo quit | nvidia-cuda-mps-control

When I engage the MPS, then try to run a job, I see nivida-cuda-mps working very hard to take up one of the CPUs, then I get the error message “cudaGetDeviceCount failed unknown error” printed to the screen for each time I try to run a CUDA program. This is not the first such box to give me this problem, but I am not certain where it is coming from or why I’ve had such good results elsewhere. Can anyone point out something I am not doing right?

Thanks,

Dave

njuffa · February 21, 2019, 2:55am

What kind of boxes? What OS, what GPU? What CUDA version?

I assume you have already checked your setup against the list of known limitations documented in section 2.3.1.1 at the following link, to make sure none of them apply: https://docs.nvidia.com/deploy/mps/index.html

dscerutti · February 21, 2019, 3:34am

Yes, I had not seen that list of limitations, but none of them seem to apply. These are all 64-bit Linux boxes, only two of them don’t seem to cooperate with MPS. The GPUs I’m trying to use in each case that MPS does not cooperate ate GTX-1080Tis. With RTX-2080Ti, Titan-V, and V100, I’ve been getting really nice results.

On one box that does not cooperate, it’s CUDA 10.0, and on another it’s CUDA-9.2.