Hi, I am opening this new thread per GeneriX suggestion. The workstation that is having problem loading NVIDIA driver is running on 18.04.6 LTS (Bionic Beaver), the problem surfaced few days ago when an user upgraded GCC, now the NVIDIA driver does no load.
This is a Lambda Labs workstation, and I have tried to uninstall and reinstalled the Lambda Stack for deep learning that is supposed to covered the NVIDIA drivers with all the deep learning modules like cuda, tensorflow and pytorch.
I have been researching on this issue for couple of days, and any helps would be greatly appreciated. I have checked that secure boot is disabled, and nvidia is not blacklisted in modprobe.d, and I have seen and tried other suggestions I found in other posts but nothing has worked so far. As far as I can tell, the compiled driver is 515.65.01 and GCC is 9.4.0
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~18.04)
Here is the output from /var/lib/gpu-manager.log
log_file: /var/log/gpu-manager.log
last_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
new_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
can’t access /opt/amdgpu-pro/bin/amdgpu-pro-px
Looking for nvidia modules in /lib/modules/5.0.0-37-generic/updates/dkms
Found nvidia module: nvidia.ko
Looking for amdgpu modules in /lib/modules/5.0.0-37-generic/updates/dkms
Is nvidia loaded? yes
Was nvidia unloaded? no
Is nvidia blacklisted? no
Is intel loaded? no
Is radeon loaded? no
Is radeon blacklisted? no
Is amdgpu loaded? no
Is amdgpu blacklisted? no
Is amdgpu versioned? no
Is amdgpu pro stack? no
Is nouveau loaded? no
Is nouveau blacklisted? yes
Is nvidia kernel module available? yes
Is amdgpu kernel module available? no
Vendor/Device Id: 10de:1e04
BusID “PCI:104@0:0:0”
Is boot vga? yes
Vendor/Device Id: 10de:1e04
BusID “PCI:26@0:0:0”
Is boot vga? no
can’t access /etc/u-d-c-nvidia-runtimepm-override file
Found json file: /usr/share/doc/nvidia-driver-495-server/supported-gpus.json
File /usr/share/doc/nvidia-driver-495-server/supported-gpus.json not found
Is nvidia runtime pm supported for “0x1e04”? yes
Trying to create new file: /run/nvidia_runtimepm_supported
Checking power status in /proc/driver/nvidia/gpus/0000:1a:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for “0x1e04”? no
Vendor/Device Id: 10de:1e04
BusID “PCI:25@0:0:0”
Is boot vga? no
can’t access /etc/u-d-c-nvidia-runtimepm-override file
Found json file: /usr/share/doc/nvidia-driver-495-server/supported-gpus.json
File /usr/share/doc/nvidia-driver-495-server/supported-gpus.json not found
Is nvidia runtime pm supported for “0x1e04”? yes
Trying to create new file: /run/nvidia_runtimepm_supported
Checking power status in /proc/driver/nvidia/gpus/0000:19:00.0/power
Runtime D3 status: Disabled by default
Is nvidia runtime pm enabled for “0x1e04”? no
Vendor/Device Id: 10de:1e04
BusID “PCI:103@0:0:0”
Is boot vga? no
can’t access /etc/u-d-c-nvidia-runtimepm-override file
Found json file: /usr/share/doc/nvidia-driver-495-server/supported-gpus.json
File /usr/share/doc/nvidia-driver-495-server/supported-gpus.json not found
Is nvidia runtime pm supported for “0x1e04”? yes
Trying to create new file: /run/nvidia_runtimepm_supported
Checking power status in /proc/driver/nvidia/gpus/0000:67:00.0/power
Runtime D3 status: ?
Is nvidia runtime pm enabled for “0x1e04”? no
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card3”, driven by “nvidia-drm”
Skipping “/dev/dri/card2”, driven by “nvidia-drm”
Skipping “/dev/dri/card1”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Does it require offloading? no
last cards number = 4
Has amd? no
Has intel? no
Has nvidia? yes
How many cards? 4
Has the system changed? No
Unsupported discrete card vendor: 10de
Nothing to do
Here is the output from ubuntu-drivers devices:
WARNING:root:_pkg_get_support nvidia-driver-515-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-510-server: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-515: package has invalid Support PBheader, cannot determine support level
== /sys/devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.0 ==
modalias : pci:v000010DEd00001E04sv00001462sd00003712bc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-520 - distro non-free recommended
driver : nvidia-driver-510-server - distro non-free
driver : nvidia-driver-515 - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin
Here is the output from modinfo nvidia
filename: /lib/modules/5.0.0-37-generic/updates/dkms/nvidia.ko
firmware: nvidia/515.65.01/gsp.bin
alias: char-major-195-*
version: 515.65.01
supported: external
license: NVIDIA
srcversion: 8049D44E2C1B08F41E1B8A6
alias: pci:v000010DEdsvsdbc06sc80i00
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: drm
retpoline: Y
name: nvidia
vermagic: 5.0.0-37-generic SMP mod_unload
parm: NvSwitchRegDwords:NvSwitch regkey (charp)
parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid…] (charp)
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_RestrictProfilingToAdminUsers:int
parm: NVreg_PreserveVideoMemoryAllocations:int
parm: NVreg_EnableS0ixPowerManagement:int
parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int
parm: NVreg_DynamicPowerManagement:int
parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int
parm: NVreg_EnableGpuFirmware:int
parm: NVreg_EnableGpuFirmwareLogs:int
parm: NVreg_OpenRmEnableUnsupportedGpus:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_KMallocHeapMaxSize:int
parm: NVreg_VMallocHeapMaxSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_NvLinkDisable:int
parm: NVreg_EnablePCIERelaxedOrderingMode:int
parm: NVreg_RegisterPCIDriver:int
parm: NVreg_EnableDbgBreakpoint:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_GpuBlacklist:charp
parm: NVreg_TemporaryFilePath:charp
parm: NVreg_ExcludedGpus:charp
parm: NVreg_DmaRemapPeerMmio:int
parm: rm_firmware_active:charp
Here is the output from systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
** Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)**
** Active: failed (Result: exit-code) since Thu 2022-10-27 15:15:29 PDT; 18h ago**
** Process: 1699 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)**
** Process: 1697 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=1/FAILURE)**
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: Started (1698)
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1697]: nvidia-persistenced failed to initialize. Check syslog for more details.
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia) exist, and that user 122 has read and write permissions for those files.*
Oct 27 15:15:29 arvand.usc.edu systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: PID file unlocked.
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: PID file closed.
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Oct 27 15:15:29 arvand.usc.edu nvidia-persistenced[1698]: Shutdown (1698)
Oct 27 15:15:29 arvand.usc.edu systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Oct 27 15:15:29 arvand.usc.edu systemd[1]: Failed to start NVIDIA Persistence Daemon.
GeneriX pointed out it is a permission issue, any further assistance would be greatly appreciated.
Thanks,