Vanishing CUDA/nVidia drivers on Ubuntu 18.04 Server


I could not find any information on the topic that would be relevant to my graphics card (nVidia Quadro RTX6000). I have issues with nVidia/CUDA drivers on my computing platform (HPE ML350 Gen 10th) - they tend to “vanish”. For example, yesterday I was running computations on all 4 graphics card and at some point
the the platform rebooted on its own and after the restart nvidia-smi is giving me an error message that it could not communicate to nVidia card because it is lacking the drivers. It also happened once before when I changed platform’s computing profile in BIOS.
I downloaded and installed the newest nVidia drivers and CUDA-tools from the nVidia website according to instructions from nVidia support. I’d be grateful for any help.

I suggest that the spontaneous reboot is something that you should address with HPE.

Regarding drivers on reboot, one problem could be that you have linux kernel updates (or just linux updates) turned on. If you allow the kernel to be updated, and haven’t taken steps to address this, it will break your driver install, necessitating a reinstall of the the driver - even the same driver.

One method to work around this is to stop updates or stop kernel updates.

Another method is to put in place a system such as DKMS that will address this for you. I won’t be able to give you a tutorial on DKMS here, but google should be able to help.

Hi Robert, thanks for your fast reply. So in your opinion, the reboot during calculations was not caused by nVidia drivers and was not related to the lack of them later…?

I don’t know what is going on exactly. I don’t know of a way that NVIDIA drivers can get a linux OS to reboot. I suppose anything is possible (I don’t know what I don’t know). But even if the issue ultimately gets connected to NVIDIA software, I believe the right way to handle it is via HPE. You’re welcome to do as you wish, of course. I’ve edited my previous statement to say “I suggest…”

ok, thank you for the help. I will contact the HP support as you suggested. Cheers!