Curious for any ideas, as we hit a roadblock. Any reason rebooting vmware RHEL VMs would start breaking nvidia-docker apps? Or how to fix ‘CUDA-capable devices are busy or unavailable’ + ‘gridd: failed to initialize RM client’ when nvidia-smi works and other VMs are able to use the GPU?
Background:
We had our hypervisor (vmware) + 2 guests RHELs (Q or C splitting GPU mem in half) + nvidia-docker CUDA app running fine on both. Namely, nvidia-smi running fine in the hypervisor/guest docker, Nvidia RAPIDS cuda context creation / compute / etc in docker.
At some point, one of the GPU VMs was rebooted and the docker images failed to start. Nvidia-smi did not work, yet it was fine in the hypervisor and the other VM, and licensing seemed fine in the other VM. We reinstalled the guest’s original gpu driver, restarted docker, and created the docker images, which immediately fixed nvidia-smi in the guest + docker… but then we saw “all CUDA-capable devices are busy or unavailable” errors whenever the docker app (RAPIDS) tried actually using the GPU. In addition, checking gridd logs showed error “failed to initialize RM client”. The GPU reported itself to be in Default mode (shared), and reinstalling the nvidia-docker runtime did not help.
More fun: We then took the second VM, that had been continuously working all along, and restarted it… and nvidia-smi broke at the guest level too.