Error selecting compatible GPU all CUDA-capable devices are busy or unavailable

Hi All,

I am new to nvidia and GPU computing. We have a small cluster with some K20X nvidia gpu’s that we run amber jobs on. From time to time in slurm we start failing a bunch of jobs that land on a rouge GPU with the the following error “Error selecting compatible GPU all CUDA-capable devices are busy or unavailable”. We have to drain the GPU and reboot or reset the GPU to try and revive it. Our GPU’s are set in a Exclusive Process mode. Is there a way for me to find out why these GPU’s become unavailable? I don’t see other processes using the GPU when this issue happens. Would like to try and figure out what is causing this problem and how I can fix it without blindly just rebooting the node and hoping it gets resolved.

Thanks for your help with this.
-J

Try running nvidia-smi on the affected node to see if processes are attached to those GPUs.

You can also use nvidia-smi to try to reset the GPU so that you don’t have to reboot the node. Use nvidia-smi --help to learn about the available command line options.

It’s a little puzzling that you say you have to “drain the GPU”. If your GPUs are set in exclusive process mode, then as soon as a process begins to use a GPU, no other processes can use it. Therefore it’s not surprising that you would have to drain the GPU in order to make it usable by other processes.

You may also want to use some variant of ps -ef in order to use ordinary linux tools to look for rogue, zombie, or other errant processes.

Hi,

Have you solved this problem? It happened several times to me. I have to reinstall the OS to fix it. I just wanted to know whether I have a better way? Thanks a lot!

By the way, I am using Win10.

Thanks,
Hang