Hello,
I was getting great mileage out of the MPS feature in recent CUDA versions on a machine featuring a V100, then another featuring a couple of RTX cards. However, when I try to replicate the success on other boxes, I find that it is impossible to start any CUDA jobs at all with MPS running.
The script I use to start MPS is simple:
#!/bin/bash
set -e
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
And the script I could stop it with is:
#!/bin/bash
echo quit | nvidia-cuda-mps-control
When I engage the MPS, then try to run a job, I see nivida-cuda-mps working very hard to take up one of the CPUs, then I get the error message “cudaGetDeviceCount failed unknown error” printed to the screen for each time I try to run a CUDA program. This is not the first such box to give me this problem, but I am not certain where it is coming from or why I’ve had such good results elsewhere. Can anyone point out something I am not doing right?
Thanks,
Dave