Ubuntu 20.04 GTX1080Ti suddenly "Unclaimed" and modprobe nvidia fails

Hi all,

I need some suggestions: I’m stuck with what to check next on a system that’s dropped it’s card, I would rather not reinstall the OS which is about all I’ve got left.

Ubuntu 20.04 LTS system was running successfully for 2 weeks but now seems to have lost the card.

Dmesg states (again and again):
[ 2242.792893] nvidia-nvlink: Nvlink Core is being initialized, major device number 509

[ 2242.793850] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[ 2242.793867] NVRM: The NVIDIA GPU 0000:01:00.0
NVRM: (PCI ID: 10de:1b06) installed in this system has
NVRM: fallen off the bus and is not responding to commands.
[ 2242.793907] nvidia: probe of 0000:01:00.0 failed with error -1
[ 2242.793917] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 2242.793917] NVRM: None of the NVIDIA devices were initialized.
[ 2242.794110] nvidia-nvlink: Unregistered the Nvlink Core, major device number 509

lspci lists sensible outputs but does NOT list a driver in use:
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. GP102 [GeForce GTX 1080 Ti]
Flags: fast devsel, IRQ 16
Memory at a3000000 (32-bit, non-prefetchable) [virtual] [size=16M]
Memory at 90000000 (64-bit, prefetchable) [virtual] [size=256M]
Memory at a0000000 (64-bit, prefetchable) [virtual] [size=32M]
I/O ports at 4000 [virtual] [size=128]
Expansion ROM at a4000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [128] Power Budgeting
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
Capabilities: [900] Secondary PCI Express
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

nvidia-detector claims a 510 is appropriate:
nvidia-driver-510

I’ve tried to set the driver through the Additional drivers menu, it claims to be using 510.
I’ve checked for blacklists and found only the framebuffer listed:
grep -ri nvidia /etc/modprobe.d/
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb

yet modprobe says:
modprobe nvidia
modprobe: ERROR: could not insert ‘nvidia’: No such device

It’s been rebooted, booted and punched, but it still doesn’t want to claim the card.
All suggestions gratefully received,

M

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thank you! Here it is

nvidia-bug-report.log.gz (4.2 MB)

Please post the output of
grep 10de /lib/udev/rules.d/*

Thanks again, I really appreciate the fact you know what to look for

grep 10de /lib/udev/rules.d/*
/lib/udev/rules.d/71-nvidia.rules:SUBSYSTEM==“pci”, ATTRS{vendor}==“0x10de”, DRIVERS==“nvidia”, TAG+=“seat”, TAG+=“master-of-seat”
/lib/udev/rules.d/71-nvidia.rules:ACTION==“bind”, SUBSYSTEM==“pci”, ATTR{vendor}==“0x10de”, ATTR{class}==“0x03[0-9]*”, TEST==“power/control”, ATTR{power/control}=“auto”
/lib/udev/rules.d/71-nvidia.rules:ACTION==“add”, SUBSYSTEM==“pci”, ATTR{vendor}==“0x10de”, ATTR{class}==“0x040300”, TEST==“power/control”, ATTR{power/control}=“auto”
/lib/udev/rules.d/71-nvidia.rules:ACTION==“add”, SUBSYSTEM==“pci”, ATTR{vendor}==“0x10de”, ATTR{class}==“0x0c0330”, TEST==“power/control”, ATTR{power/control}=“auto”
/lib/udev/rules.d/71-nvidia.rules:ACTION==“add”, SUBSYSTEM==“pci”, ATTR{vendor}==“0x10de”, ATTR{class}==“0x0c8000”, TEST==“power/control”, ATTR{power/control}=“auto”

Some things to try (one after another):

  1. check for a bios update
  2. delete /lib/udev/rules.d/71-nvidia.rules, then run
    sudo update-initramfs -u
    and reboot
  3. downgrade kernel to 5.4
    sudo apt install --install-recommends linux-generic
    https://wiki.ubuntu.com/Kernel/LTSEnablementStack