vGPU guests fail after restart with all CUDA-capable devices are busy or unavailable

Curious for any ideas, as we hit a roadblock. Any reason rebooting vmware RHEL VMs would start breaking nvidia-docker apps? Or how to fix ‘CUDA-capable devices are busy or unavailable’ + ‘gridd: failed to initialize RM client’ when nvidia-smi works and other VMs are able to use the GPU?

Background:

We had our hypervisor (vmware) + 2 guests RHELs (Q or C splitting GPU mem in half) + nvidia-docker CUDA app running fine on both. Namely, nvidia-smi running fine in the hypervisor/guest docker, Nvidia RAPIDS cuda context creation / compute / etc in docker.

At some point, one of the GPU VMs was rebooted and the docker images failed to start. Nvidia-smi did not work, yet it was fine in the hypervisor and the other VM, and licensing seemed fine in the other VM. We reinstalled the guest’s original gpu driver, restarted docker, and created the docker images, which immediately fixed nvidia-smi in the guest + docker… but then we saw “all CUDA-capable devices are busy or unavailable” errors whenever the docker app (RAPIDS) tried actually using the GPU. In addition, checking gridd logs showed error “failed to initialize RM client”. The GPU reported itself to be in Default mode (shared), and reinstalling the nvidia-docker runtime did not help.

More fun: We then took the second VM, that had been continuously working all along, and restarted it… and nvidia-smi broke at the guest level too.

Hi Leo8,

Thanks for posting. I recommend that you open a support ticket so our support team can dig into the issue.

thanks
D

Happy to, any idea where? (This is on behalf of a F500 we’re working with, and we’re an Nvidia partner, and I was steered to this site before)

Trying to find someone with the proper forum access to the enterprise portal. Meanwhile, if any pointers, would be appreciated – these systems were designed avoid surprises like this, so we’re curious.