Ubuntu 18.04 NVIDIA driver not loaded after GCC update

Hi, I am opening this new thread per GeneriX suggestion. The workstation that is having problem loading NVIDIA driver is running on 18.04.6 LTS (Bionic Beaver), the problem surfaced few days ago when an user upgraded GCC, now the NVIDIA driver does no load.

This is a Lambda Labs workstation, and I have tried to uninstall and reinstalled the Lambda Stack for deep learning that is supposed to covered the NVIDIA drivers with all the deep learning modules like cuda, tensorflow and pytorch.

I have been researching on this issue for couple of days, and any helps would be greatly appreciated. I have checked that secure boot is disabled, and nvidia is not blacklisted in modprobe.d, and I have seen and tried other suggestions I found in other posts but nothing has worked so far. As far as I can tell, the compiled driver is 515.65.01 and GCC is 9.4.0

NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~18.04)

Here is the output from /var/lib/gpu-manager.log

log_file: /var/log/gpu-manager.log
last_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
new_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
can’t access /opt/amdgpu-pro/bin/amdgpu-pro-px
Looking for nvidia modules in /lib/modules/5.0.0-37-generic/updates/dkms
Found nvidia module: nvidia.ko
Looking for amdgpu modules in /lib/modules/5.0.0-37-generic/updates/dkms
Is nvidia loaded? yes
Was nvidia unloaded? no
Is nvidia blacklisted? no
Is intel loaded? no
Is radeon loaded? no
Is radeon blacklisted? no
Is amdgpu loaded? no
Is amdgpu blacklisted? no
Is amdgpu versioned? no
Is amdgpu pro stack? no
Is nouveau loaded? no
Is nouveau blacklisted? yes
Is nvidia kernel module available? yes
Is amdgpu kernel module available? no
Vendor/Device Id: 10de:1e04
BusID “PCI:104@0:0:0”
Is boot vga? yes
Vendor/Device Id: 10de:1e04
BusID “PCI:26@0:0:0”
Is boot vga? no
can’t access /etc/u-d-c-nvidia-runtimepm-override file
Found json file: /usr/share/doc/nvidia-driver-495-server/supported-gpus.json
File /usr/share/doc/nvidia-driver-495-server/supported-gpus.json not found
Is nvidia runtime pm supported for “0x1e04”? yes
Trying to create new file: /run/nvidia_runtimepm_supported
Checking power status in /proc/driver/nvidia/gpus/0000:1a:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for “0x1e04”? no
Vendor/Device Id: 10de:1e04
BusID “PCI:25@0:0:0”
Is boot vga? no
can’t access /etc/u-d-c-nvidia-runtimepm-override file
Found json file: /usr/share/doc/nvidia-driver-495-server/supported-gpus.json
File /usr/share/doc/nvidia-driver-495-server/supported-gpus.json not found
Is nvidia runtime pm supported for “0x1e04”? yes
Trying to create new file: /run/nvidia_runtimepm_supported
Checking power status in /proc/driver/nvidia/gpus/0000:19:00.0/power
Runtime D3 status: Disabled by default
Is nvidia runtime pm enabled for “0x1e04”? no
Vendor/Device Id: 10de:1e04
BusID “PCI:103@0:0:0”
Is boot vga? no
can’t access /etc/u-d-c-nvidia-runtimepm-override file
Found json file: /usr/share/doc/nvidia-driver-495-server/supported-gpus.json
File /usr/share/doc/nvidia-driver-495-server/supported-gpus.json not found
Is nvidia runtime pm supported for “0x1e04”? yes
Trying to create new file: /run/nvidia_runtimepm_supported
Checking power status in /proc/driver/nvidia/gpus/0000:67:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for “0x1e04”? no
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Does it require offloading? no
last cards number = 4
Has amd? no
Has intel? no
Has nvidia? yes
How many cards? 4
Has the system changed? No
Unsupported discrete card vendor: 10de
Nothing to do

Here is the output from ubuntu-drivers devices:
WARNING:root:_pkg_get_support nvidia-driver-515-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-510-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-515: package has invalid Support PBheader, cannot determine support level
== /sys/devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.0 ==
modalias : pci:v000010DEd00001E04sv00001462sd00003712bc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-520 - distro non-free recommended
driver : nvidia-driver-510-server - distro non-free
driver : nvidia-driver-515 - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin

Here is the output from modinfo nvidia

filename: /lib/modules/5.0.0-37-generic/updates/dkms/nvidia.ko
firmware: nvidia/515.65.01/gsp.bin
alias: char-major-195-*
version: 515.65.01
supported: external
license: NVIDIA
srcversion: 8049D44E2C1B08F41E1B8A6
alias: pci:v000010DEdsvsdbc06sc80i00
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: drm
retpoline: Y
name: nvidia
vermagic: 5.0.0-37-generic SMP mod_unload
parm: NvSwitchRegDwords:NvSwitch regkey (charp)
parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid…] (charp)
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_RestrictProfilingToAdminUsers:int
parm: NVreg_PreserveVideoMemoryAllocations:int
parm: NVreg_EnableS0ixPowerManagement:int
parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int
parm: NVreg_DynamicPowerManagement:int
parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int
parm: NVreg_EnableGpuFirmware:int
parm: NVreg_EnableGpuFirmwareLogs:int
parm: NVreg_OpenRmEnableUnsupportedGpus:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_KMallocHeapMaxSize:int
parm: NVreg_VMallocHeapMaxSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_NvLinkDisable:int
parm: NVreg_EnablePCIERelaxedOrderingMode:int
parm: NVreg_RegisterPCIDriver:int
parm: NVreg_EnableDbgBreakpoint:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_GpuBlacklist:charp
parm: NVreg_TemporaryFilePath:charp
parm: NVreg_ExcludedGpus:charp
parm: NVreg_DmaRemapPeerMmio:int
parm: rm_firmware_active:charp

Here is the output from systemctl status nvidia-persistenced

● nvidia-persistenced.service - NVIDIA Persistence Daemon
** Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)**
** Active: failed (Result: exit-code) since Thu 2022-10-27 15:15:29 PDT; 18h ago**
** Process: 1699 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)**
** Process: 1697 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=1/FAILURE)**

Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: Started (1698)
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1697]: nvidia-persistenced failed to initialize. Check syslog for more details.
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia) exist, and that user 122 has read and write permissions for those files.*
Oct 27 15:15:29 arvand.usc.edu systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: PID file unlocked.
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: PID file closed.
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: Shutdown (1698)
Oct 27 15:15:29 arvand.usc.edu systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Oct 27 15:15:29 arvand.usc.edu systemd[1]: Failed to start NVIDIA Persistence Daemon.

GeneriX pointed out it is a permission issue, any further assistance would be greatly appreciated.

Thanks,

Here is the bug report
nvidia-bug-report.log (931.6 KB)

Like said, the nvidia driver is loading fine, but neither nvidia-smi nor the DDX can make use of it.
Please post the output of
ls -l /dev/nvidia*

Yes, you did, if you could point out how you identified the problem from the logs/outputs I uploaded that would be greatly appreciated.

Here is the output you requested:

crw-rw-rw- 1 root root 510, 0 Oct 27 15:15 /dev/nvidia-uvm
crw-rw-rw- 1 root root 510, 1 Oct 27 15:15 /dev/nvidia-uvm-tools

I can see from a similar system (below) that drivers and additional modules are loaded.

crw-rw-rw- 1 root nvidia0 195, 0 Oct 4 12:13 /dev/nvidia0
crw-rw-rw- 1 root nvidia1 195, 1 Oct 4 12:13 /dev/nvidia1
crw-rw-rw- 1 root nvidia2 195, 2 Oct 4 12:13 /dev/nvidia2
crw-rw-rw- 1 root nvidia3 195, 3 Oct 4 12:13 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Oct 4 12:13 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Oct 4 12:13 /dev/nvidia-modeset
crw-rw-rw- 1 root root 236, 0 Oct 4 12:13 /dev/nvidia-uvm
crw-rw-rw- 1 root root 236, 1 Oct 4 12:13 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0

What can I do to fix this?

Thanks,

Please check if you have any unusual NVreg module parameters set, especially NVreg_ModifyDeviceFiles
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
and try to set it as a kernel parameter
nvidia.NVreg_ModifyDeviceFiles=1

Thanks for helping on a Sunday.

Here is the result when I do grep nvidia /etc/modprobe.d/ /lib/modprobe.d/*

/etc/modprobe.d/50-nvidia.conf:options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0777 NVreg_ModifyDeviceFiles=0 NVreg_RestrictProfilingToAdminUsers=0
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer
/lib/modprobe.d/nvidia-kms.conf:# This file was generated by nvidia-prime
/lib/modprobe.d/nvidia-kms.conf:options nvidia-drm modeset=1
/lib/modprobe.d/nvidia-runtimepm.conf:options nvidia “NVreg_DynamicPowerManagement=0x02”

Based on your previous comment, I need to set NVreg_ModifyDeviceFiles=1 in /etc/modprobe.d/50-nvidia-conf, correct?

If that is the case, do I need to run sudo update-initramfs -u to update the local image?

Also, here is the output from a similar box when I ran grep nvidia /etc/modprobe.d/ /lib/modprobed/** for your reference.

/etc/modprobe.d/50-nvidia.conf:options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0777 NVreg_ModifyDeviceFiles=0 NVreg_RestrictProfilingToAdminUsers=0
/etc/modprobe.d/50-nvidia.conf:#install nvidia PATH=$PATH:/bin:/usr/bin; /sbin/modprobe --ignore-install nvidia; /sbin/modprobe nvidia_uvm; test -c /dev/nvidia-uvm || mknod -m 777 /dev/nvidia-uvm c $(cat /proc/devices | while read major device; do if [ “$device” == “nvidia-uvm” ]; then echo $major; break; fi ; done) 0 && chown :root /dev/nvidia-uvm; test -c /dev/nvidiactl || mknod -m 777 /dev/nvidiactl c 195 255 && chown :root /dev/nvidiactl; devid=-1; for dev in $(ls -d /sys/bus/pci/devices/*); do vendorid=$(cat $dev/vendor); if [ “$vendorid” == “0x10de” ]; then class=$(cat $dev/class); classid=${class%%00}; if [ “$classid” == “0x0300” -o “$classid” == “0x0302” ]; then devid=$((devid+1)); test -c /dev/nvidia${devid} || mknod -m 660 /dev/nvidia${devid} c 195 ${devid} && chown :nvidia${devid} /dev/nvidia${devid}; fi; fi; done
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer

As you can see that the same parameter also appears in /etc/modprobe.d/50-nvidia.conf and is set to 0.

Thanks,

Please try setting it to 1 and run sudo update-initramfs -u afterwards.

It is working now, thank you so much for your assistance!

Can you point out where you saw the problem or is this just based on your experience?

Again, thank you!

dmesg showed the nvidia driver loading fine but nvidia-smi couldn’t find it, meaning there’s something wrong with the device files.
How the device files are created is often distro/version specific. So the second system might have an additional udev rule/script running.

Thank you!