New 455.32 driver install crashes ubuntu 20.04, cannot boot, recovery mode no longer accessible, 2080ti, kernel 5.4.0-56, AMD ryzen 9

Having a very difficult time for several days trying to get normal boot off a fresh install of ubuntu 20.04 with either cuda11 w/ nvidia-driver-455.32 packaged or nvidia-driver455.32 alone. Have followed dozens of post on this forums which are related - no existing solution works and no errors/logs give precisely same behavior. Nouveau blacklisted in /etc/. Secure boot is disabled. Installed drivers from GUI or ppa or ubuntu dirvers autoinstall or with cuda11. Need to use cuda 10 or 11, so can’t revert to old drivers not matching cuda version requirements. No nvidia blacklist files in /etc/modprobe.d or /lib/modprobe.d

@generix you may be familiar with this issue, could you please help?

After fresh ubuntu 20.04 install and any method of installing nvidia-driver455.32, my desktop (Asus x570 mb build) will not boot to viable graphics for more than 2-5 minutes. Computer boots & functions normally when nvidia driver is purged and nouveau driver is used. Booting with active nvidia driver 455 will progress to purple log-in screen, stay active for 2-3 minutes, then system crashes - sometimes GPU completely disconnects & monitor registers no connection/activity or the other times I get black screen with string of “NVRM Xid: GPU 0:00:08:00: 45” errors and have to go to tty (cntr+alt+F2-12). Before the crash, when ubuntu boots and after good log-in, I can open terminal and enter nvidia-smi, which returns working table with 455 loaded & 2080ti recognized. So the driver install worked & GPU is recognized, but GPU or driver is somehow crashing shortly after boot. After crash, if I can reach to tty terminal then nvidia-smi gives “Unable to determine the device handle for GPU 0000:08:00.0: Unknown Error”. This occurs when I remove & remount GPU in either PCIe slot on motherboard. I haven’t placed GPU in different computer, but that’s a next step.

Some output info…

> sudo lshw -C display
*-display
description: VGA compatible controller
product: TU102 [GeForce RTX 2080 Ti Rev. A]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:08:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:93 memory:f6000000-f6ffffff memory:e0000000-efffffff memory:f0000000-f1ffffff ioport:e000(size=128) memory:c0000-dffff

After crashing in tty

> nvidia-debugdump
Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

contents of log_file: /var/log/gpu-manager.log

last_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
new_boot_file: /var/lib/ubuntu-drivers-common/last_gfx_boot
can’t access /opt/amdgpu-pro/bin/amdgpu-pro-px
Looking for nvidia modules in /lib/modules/5.4.0-56-generic/updates/dkms
Found nvidia module: nvidia-uvm.ko
Looking for amdgpu modules in /lib/modules/5.4.0-56-generic/updates/dkms
Is nvidia loaded? yes
Was nvidia unloaded? no
Is nvidia blacklisted? no
Is intel loaded? no
Is radeon loaded? no
Is radeon blacklisted? no
Is amdgpu loaded? no
Is amdgpu blacklisted? no
Is amdgpu versioned? no
Is amdgpu pro stack? no
Is nouveau loaded? no
Is nouveau blacklisted? yes
Is nvidia kernel module available? yes
Is amdgpu kernel module available? no
Vendor/Device Id: 10de:1e07
BusID “PCI:8@0:0:0”
Is boot vga? yes
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Skipping “/dev/dri/card0”, driven by “nvidia-drm”
Does it require offloading? no
last cards number = 1
Has amd? no
Has intel? no
Has nvidia? yes
How many cards? 1
Has the system changed? No
Single card detected
Nothing to do

> lsmod | grep nvidia
nvidia_uvm 1003520 0
nvidia_drm 53248 0
nvidia_modeset 1212416 2 nvidia_drm
nvidia 27676672 19 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 1 nvidia_drm
drm 491520 3 drm_kms_helper,nvidia_drm
i2c_nvidia_gpu 16384 0

lsmod | grep nouveau
…nothing

>nvidia-settings
Unable to init server: Could not connect: Connection refused
ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

I cannot boot to recovery mode, as this will simply turn to black screen where tty cannot be accesses. The monitor that is connected by hdmi to RTX 2080ti stops recognizing that a GPU is connected - no graphics.

from driver information log
NVRM version: NVIDIA UNIX x86_64 Kernel Module 455.32.00 Wed Oct 14 22:46:18 UTC 2020
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

Ryzen 9, RTX 2080ti, ubuntu 20.04 new install, 5.4.0-56-generic

no /etc/X11/xorg.conf or nvidia-install.log

bug-report.log.gz - nvidia-bug-report.log (594.3 KB)

From the bug report log, it looks like the GPU is experiencing some pretty significant stability problems. It’s not clear what’s causing that, though. I would recommend checking to see if there is a system BIOS update available for your motherboard and if that doesn’t help, trying the GPU in a different system would be a good next step for troubleshooting.

Hi I am having a lot of issues after some update that I did not allow!!!
The nvidia on my ubuntu 20.04 updated to 470. Now all my simulations with LG SVL and Apollo/Autoware cannot be done! It last just 3 weeks to present my master thesis and Nvidia driver does not work and shutdown my system after I run the simulation, all the computer turns off!
I checked a lot of guys had the same issue!!!
Don’t Nvidia test the drivers before put this forced on Ubuntu? What must I do now to fix this mess?
I have already fixed some issues with DOCKER and needed to reinstall docker, pull and build all the images from scrathch…now this issue with the simulator…if I try to use the older drivers (before 470) from software update program, the driver is not changed…