I am new to nvidia and GPU computing. We have a small cluster with some K20X nvidia gpu’s that we run amber jobs on. From time to time in slurm we start failing a bunch of jobs that land on a rouge GPU with the the following error “Error selecting compatible GPU all CUDA-capable devices are busy or unavailable”. We have to drain the GPU and reboot or reset the GPU to try and revive it. Our GPU’s are set in a Exclusive Process mode. Is there a way for me to find out why these GPU’s become unavailable? I don’t see other processes using the GPU when this issue happens. Would like to try and figure out what is causing this problem and how I can fix it without blindly just rebooting the node and hoping it gets resolved.
Thanks for your help with this.