[Solved] Issue with nvidia driver in Linux Ubuntu 18.04 and CUDA 11.4

Hello,

I have a ubuntu machine in GCP with a Tesla K80, it was working fine but yesterday I noticed that stopped working for some reason. Already tried everything, re-installing other CUDA versions, looked inumerous posts but nothing helped…

Some outputs:

$ sudo prime-select query
nvidia

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 $ dkms status
nvidia, 465.19.01: added

$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer

$ sudo modprobe nvidia
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.4.0-1055-gcp

$ systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2021-10-28 18:11:35 UTC; 40min ago
  Process: 1893 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/* (code=exited, status=0/SUCCESS)
  Process: 1879 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=1/FAILURE)

Oct 28 18:11:35 train-ia systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Oct 28 18:11:35 train-ia systemd[1]: Failed to start NVIDIA Persistence Daemon.
Oct 28 18:11:35 train-ia systemd[1]: nvidia-persistenced.service: Service hold-off time over, scheduling restart.
Oct 28 18:11:35 train-ia systemd[1]: nvidia-persistenced.service: Scheduled restart job, restart counter is at 5.
Oct 28 18:11:35 train-ia systemd[1]: Stopped NVIDIA Persistence Daemon.
Oct 28 18:11:35 train-ia systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Oct 28 18:11:35 train-ia systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Oct 28 18:11:35 train-ia systemd[1]: Failed to start NVIDIA Persistence Daemon.

$ sudo mokutil --sb-state
SecureBoot disabled

I’m also uploading the log file, hope you guys can help
nvidia-bug-report.log.gz (714.4 KB)

You switched to gcc 5.5 but the kernel was compiled with gcc 7.5 so on kernel update, the driver failed to compile. Please switch back to gcc 7.5 and reinstall the driver from repo.

Thank you so much for your quick reply!

I switched the gcc version to 7.5 and re-installed the driver from repo, it compiled and installed successfully.

$ gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ dkms status
nvidia, 495.29.05, 5.4.0-1055-gcp, x86_64: installed

But it still wasn’t working, so I tried restarting nvidia-persistenced.service and realized I should install the branch 470 driver

$ journalctl -xe
-- 
-- Unit nvidia-persistenced.service has begun starting up.
Oct 29 12:16:49 train nvidia-persistenced[3001]: Verbose syslog connection opened
Oct 29 12:16:49 train nvidia-persistenced[3001]: Started (3001)
Oct 29 12:16:50 train kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 239
Oct 29 12:16:50 train kernel: NVRM: The NVIDIA Tesla K80 GPU installed in this system is
                                 NVRM:  supported through the NVIDIA 470.xx Legacy drivers. Please
                                 NVRM:  visit http://www.nvidia.com/object/unix.html for more
                                 NVRM:  information.  The 495.29.05 NVIDIA driver will ignore
                                 NVRM:  this GPU.  Continuing probe...
Oct 29 12:16:50 train kernel: NVRM: No NVIDIA GPU found.
Oct 29 12:16:50 train kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 239
Oct 29 12:16:50 train nvidia-persistenced[3001]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
Oct 29 12:16:50 train nvidia-persistenced[3001]: PID file unlocked.
Oct 29 12:16:50 train nvidia-persistenced[2997]: nvidia-persistenced failed to initialize. Check syslog for more details.
Oct 29 12:16:50 train nvidia-persistenced[3001]: PID file closed.
Oct 29 12:16:50 train systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
Oct 29 12:16:50 train nvidia-persistenced[3001]: Shutdown (3001)
Oct 29 12:16:50 train systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Oct 29 12:16:50 train systemd[1]: Failed to start NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit nvidia-persistenced.service has failed.
-- 
-- The result is RESULT.
Oct 29 12:16:50 train systemd[1]: nvidia-persistenced.service: Service hold-off time over, scheduling restart.

So I ran sudo apt-get install cuda-drivers-470 and everything works fine now.

Thanks again, saved my day!