RTX 2080 Ti boots to black screen with "GPU:0: Lost display notification", "Error while waiting for GPU progress", "NVRM: RmInitAdapter failed!"

TLDR: Is it possible that by board is fried even though it works with nouveau?

I recently purchased an ex-display RTX 2080 Ti from a reputable vendor. Using the nouveau driver all works as well as expected (4k output, no accelleration), but when I use the nvidia driver the monitor switches off immediately after showing the GRUB menu.

I’ve been able to SSH into the box and collect a bug report: nvidia-bug-report.log.gz (882.6 KB)

Looking through the logs, I see this:

Dec 25 10:43:15 hjk-desktop kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Dec 25 10:49:36 hjk-desktop kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57e:0 2:0:392:380
Dec 25 10:49:41 hjk-desktop kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57e:0 2:0:392:380
     <last line repeated many times>

and, elsewhere

Dec 24 15:08:20 hjk-MS-7C80 kernel: [   11.546230] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Dec 24 15:12:11 hjk-MS-7C80 kernel: [  242.666293] INFO: task nvidia-modeset/:639 blocked for more than 120 seconds.
Dec 24 15:12:11 hjk-MS-7C80 kernel: [  242.666296] nvidia-modeset/ D    0   639      2 0x80004000
Dec 24 15:14:12 hjk-MS-7C80 kernel: [  363.482418] INFO: task nvidia-modeset/:639 blocked for more than 241 seconds.
Dec 24 15:14:12 hjk-MS-7C80 kernel: [  363.482421] nvidia-modeset/ D    0   639      2 0x80004000

and the XServer is using 100% CPU, even though there doesn’t seem to be anything suspicious in the X.org logs.

Over the last few days it has also sometimes got as far as showing a blinking cursor, allowing me to switch to a VT, where I saw NVRM: RmInitAdapter failed! in the log, but this doesn’t seem to happen consistently and may depend on the driver version.

I’m pretty sure that my driver is installed and configured correctly since if I remove my RTX card an install my old GTX 660 Ti everything works perfectly.

Hardware:

  • TU102 [GeForce RTX 2080 Ti Rev. A]
  • AOC U2879G6 4K monitor
  • MAG Z490 Tomahawk motherboard with latest firmware, CSM boot
  • Intel® Core™ i9-10900F CPU @ 2.80GHz
  • EVGA Supernova 750W (with two independent 8-pin PCIE cables powering the card)

Software:

  • Ubuntu 20.04 (clean install)
  • Nvidia driver version 450.80.02 (also tried 390, 418, 430, 435, 440, 455)
  • XServer 2:1.20.8-2ubuntu2.6

I’ve tried a whole bunch of things from lurking through these forums which have had no effect:

  • Installing the nvidia driver via the latest *.run installer
  • connecting to the monitor via either HDMI or Display Port
  • connecting to a lower resolution monitor (my old LG W2242S [1680x1050]; had to use USB-C and a converter since it only supports VGA) since I saw people having issues with some 4K devices
  • every nvidia driver version available with Ubuntu 20.04 (390/418/430/435/440/450/455)
  • various kernel arguments (nomodeset, nvidia-drm.modeset=1, mem_encrypt=off)
  • updating motherboard firmware to the lastest
  • completely clean OS install

My best guess at this point is that the RTX board is faulty in some non-obvious way. This conclusion appears to be supported by the fact that my old GTX card works perfectly with an otherwise identical hardware and software config, but doesn’t explain how the nouveau driver is able to get the RTX card to function – though I guess it’s using far less of the board’s circuitry.

Any help or suggestions greatly appreciated - at present I plan on returning it once things begin to re-open in January.

You need UEFI boot for the nvidia driver. Guess you need to reinstall the OS in UEFI mode.

Ah, interesting. Is there a reference anywhere for which cards are supported using UEFI or legacy? In general this point seems to cause quite a bit of confusion on forums, but NVIDIA’s own list of supported products neglets to mention UEFI / legacy boot.

Even the product listing for the card claims it has “Dual Bios”, which I naively took to mean that it supports both legacy and UEFI.

I previously tried installing using UEFI, but could not get it to boot for the life of me – I had basically the same issue, except this time the screen had a few lines of pixel artefacting… thanks for the advice, I’ll have another go at it

Well that’s something I got from reading stuff at this forum.
Could not find a certain reference in nvidia docs either.
The RmInitAdapter error usually points to hardware failure.
Might still be worth a try before returning the card.
Secure boot must be disabled… in case you do…
Make sure nouveau is properly disabled…

EFI boot is only mandatory when using Teslas and even that’s only due to buggy system firmware, not due to the Tesla by itself. R/GTX type cards work fine with CSM boot, at least I haven’t seen any problems with that lately.
In the logs, you were always getting an XID 61 and combined with the RmInit failed messages you were seeing, I’m leaning towards a defective gpu, I guess you already reseated the board in its slot and checked power cabling. There’s a slim chance that yo’re running into some board/cpu incompatibility, so rather check if it works in another system, then try to get an RMA from your vendor, if possible.
Other than that, you already tried anything possible.