Driver Seems to Disappear (Containers)

Hi, we’re using CUDA in containers using libnvidia-container on Ubuntu 18.04. Things are working very well, but occasionally we run into a situation where the driver appears to just disappear. When we run nvidia-smi on a machine that had a job previously running, it will report:

└──> $ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

The driver is installed using dpkg -i on the .deb file weeks before this happened. There’s nothing in the apt history showing someone uninstalled it, and there shouldn’t be a way for users to do that anyways. The only way I saw to replicate it easily was shutting down the machine, installing new cards, and when it came back, this was reported until the same driver was reinstalled. However, in the case above, this is during normal runs and the system was not shut down.

It’s not clear if these observations are referring to what is seen inside a container, or what is seen on the base machine.

I assume this is referring to the base machine. if that is the case, one typical cause of it is updates being applied to the kernel or other software stacks that the driver depends on. These will often break the current driver install, unless DKMS is properly set up.

Since you say the system is not shut down, that wouldn’t seem to apply. (unless there’s any uncertainty there).

When the system gets in the failed state, I usually find a run of

sudo dmesg | grep NVRM

to be helpful.

If this is referring to a loss of the driver in the base machine, then I doubt this has anything to d with containers or libnvidia-container, but you could also ask your question on that forum:

[url]https://devtalk.nvidia.com/default/board/316/nvidia-container-runtimes/[/url]

Hi Robert, I forgot to mention that, but this is on the base machine. I don’t believe the server was shut down when this happened, but I will monitor it next time to make sure. You are right that I didn’t think this has to do with containers, but I wanted to make sure to include that just in case there was some dependency I didn’t know about. Next time it happens I will issue that command and get back to you. Thanks.