Hi, we’re using CUDA in containers using libnvidia-container on Ubuntu 18.04. Things are working very well, but occasionally we run into a situation where the driver appears to just disappear. When we run nvidia-smi on a machine that had a job previously running, it will report:
└──> $ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
The driver is installed using dpkg -i on the .deb file weeks before this happened. There’s nothing in the apt history showing someone uninstalled it, and there shouldn’t be a way for users to do that anyways. The only way I saw to replicate it easily was shutting down the machine, installing new cards, and when it came back, this was reported until the same driver was reinstalled. However, in the case above, this is during normal runs and the system was not shut down.