[Ubuntu 20.04] Messed up driver installation - nvidia-smi No devices were found

Hi,

I set up a new Ubuntu Desktop 20.04 LTS. I installed the 470 driver via Ubuntus Addiotional driver tool. That worked! And from then on, nvidia-smi showed correct output.

However, I needed ti run CUDA… I need 11.2. I decided to use the standalone .run file for that.

The installer told me, that a system driver was detected and it is strongly recommended to remove that first. Well: I did.
The installer started and gave an driver installation error. Some build went wrong, dont know…

I then read, that CUDA 11.2 has only a minimum driver requirement and I wanted to use it with the previous installed one. So I removed the installer (nothing yet installed) and reinstalled the 470 via Ubuntus tool. That worked without any error but since then:

# nvidia-smi
No devices were found

And this is the current state. I purged EVERYTHING nvidia related several times. All kernel modules and any nvidia traced were removed. I reinstalled them hundred times - nothing. Switched to 495 - nothing.

Iam not able to make the driver work correctly anymore. The CUDA test installation detected the driver but was not able to open the card.

Is there anything that I missed? Debug attached.

Thanks in advance!

# lspci -nnk | grep -iA2 vga
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2216] (rev a1)
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

nvidia-bug-report.log.gz (99.5 KB)

[   31.030373] NVRM: GPU 0000:04:00.0: Failed to copy vbios to system memory.

not looking good, seems the vbios got corrupt. Please power down the system, detach from power, let it sit unpowered for 30 minutes, then try again. Is the 3080 still under warranty?

1 Like

Oops - okay. How can this happen?

Yes, the card is still under warranty.

I tried your tip and disconnected the power source for 30 mins and bootet the system up:

[    7.838931] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  495.46  Wed Oct 27 16:31:33 UTC 2021
[    7.940992] NVRM: GPU 0000:04:00.0: Failed to copy vbios to system memory.
[    7.941082] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x30:0xffff:970)
[    7.941104] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[  225.252668] NVRM: GPU 0000:04:00.0: Failed to copy vbios to system memory.
[  225.252798] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x30:0xffff:970)
[  225.252830] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[  225.371425] NVRM: GPU 0000:04:00.0: Failed to copy vbios to system memory.
[  225.371534] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x30:0xffff:970)
[  225.371575] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0

Same issue. I will swap the cards with another and test if the system is working again.

Cosmic rays? Failing flash-rom? Shouldn’t happen but sometimes does, but mostly on older cards. Can often be fixed by re-flashing the same vbios image but this is mostly hard to find and not recommended while still under warranty. Which vendor/brand is the gpu?

Its a MSI GeForce RTX3080 Suprim X 10G.

Since you have the LHR version, techpowerup doesn’t have a matching image available. Seems you’ll have to contact MSI support about that.

Small update:

A 2060 card is working properly in the system.

We took out the 3080 and inserted it into another machine with windows. The card worked well in that system. Driver installation without issues. There is no problem noticeable. Is windows behaving in another way here? Does windows not need to copy the vbios?

So, a linux specific issue then?

On rare occasions, this also happens due to kernel/driver bugs but there’s currently nothing known to me. Did you already try to re-insert it into the linux system?

Next update: We tested it again today and now the 3080 stopped working completely. No picture at all within the windows test system. So we try to RMA it now.

So looks like picking it up, shaking it a bit re-connected something for some time.

1 Like

Just to update this: The manufacturer accepted the card and ack’ed the issue. We will get back the money, no replacement card.

nvidia-bug-report.log.gz (131.8 KB)
Could you help this out?
@generix