RTX 4090 seems to Crash early after Login, independent of OS/driver version

General Problem Description

I am trying to get a newly Built PC System to run with an RTX 4090. The specs are posted below. Since I plan to use it as a dual boot system with windows/ubuntu, I tried many different configurations, all resulting in similar behavior: When a graphics driver is installed, the graphical user interface freezes shortly after logging in (Most of the times within a few seconds). This is a general problem, in the next section I have described what I tried so far. Based on this, I do not think, it is a Software or driver problem.

My goal here is to find out if there is anything obvious that I have missed and if possible to diagnose which part of the system is faulty (meaning: If I have to RMA the GPU, Mainboard or Power Supply). Any help would be appreciated.

Attempted Solutions

  • Use of different Operating Systems: Windows 10, Windows 11, Ubuntu 22.04 LTS, Ubuntu 22.10
  • On Windows, tried different Drivers:
    ** NVIDIA Studio Driver - WHQL (528.02)
    ** GeForce Game Ready Driver - WHQL (528.02)
    ** Gigabyte NVIDIA Driver (522.25)
    ** And maybe even more
  • Update of Mainboard BIOS
  • Tried both the OC/Silent Position of the switch on the RTX 4090
  • Tried to update GPU BIOS with the official Tools from Gigabyte (Link removed due to forum limitation for new users)
    ** In all permutations of the OC/Silent switch and the GPU BIOS Tool from Gigabyte, it indicated that “This BIOS version does not Match”
  • Tried to update with the NVIDIA GPU UEFI Firmware Update Tool: (Link removed due to forum limitation for new users)
    ** This indicated that the version was already current, I unfortunately did not save a screenshot.
  • Tried different Mainboard BIOS Settings such as switching the PCIe Gen from Auto to Gen 4 or Gen 3 and more combinations. However mostly on Standard Settings.
    ** Both SSDs are installed on separate PCI Express bus from the GPU

In one case under a freshly installed Ubuntu 22.10 with the driver installed by Ubuntu (525.78.01-0ubuntu0.22.10.1), I was able to run a Unigine Superposition Benchmark: (Link removed due to forum limitation for new users). Before starting, I set the Power Settings to prefer Maximum Performance to test out some hints I have seen during my research into the issue. Interestingly, the benchmark ran through (only one time) and reported the correct card, 100% CPU usage and plausible Frame Rates/Scores. However, when I went to save the results after the Benchmark, the system froze again.

This leads me to believe that the Power Supply should be ok, since the benchmark was able to run through. Otherwise, I would have expected an earlier failure.

nvidia-bug-report/GPU falls of bus

Under Ubuntu 22.10, when I log into a tty, I get the following messages in the syslog:

Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195769] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:01.0
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195774] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195783] pcieport 0000:00:01.0:   device [8086:a70d] error status/mask=00100000/00010000
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195789] pcieport 0000:00:01.0:    [20] UnsupReq               (First)
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195794] pcieport 0000:00:01.0: AER:   TLP Header: 34000000 01000010 00000000 00000000
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195801] nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195801] snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.195812] pcieport 0000:00:01.0: AER: device recovery failed
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.565295] NVRM: GPU at PCI:0000:01:00: GPU-695bdbb4-8c56-b809-4f9c-c9e864a3ad2e
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.565318] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.565329] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.565351] NVRM: A GPU crash dump has been created. If possible, please run
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.565351] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 22 00:10:23 johannes-Z790-AERO-G kernel: [   17.565351] NVRM: the NVIDIA kernel module is unloaded.

This is reproducible. The card fans stop, it seems dead. Attached is the bug report:
nvidia-bug-report.log.gz (116.9 KB)

I do not know how to generate something like this under windows to see if it is similar. Is it possible to read from this information if the card is bad or if there is a problem with the mainboard?

** PC Specs **

CPU: Intel Core i9-13900K 3 GHz 24-Core Processor
CPU Cooler: Noctua NH-D15S chromax.black 82.51 CFM CPU Cooler
Motherboard: Gigabyte Z790 AERO G ATX LGA1700 Motherboard
Memory: G.Skill Trident Z5 RGB 64 GB (2 x 32 GB) DDR5-6400 CL32 Memory
Storage: Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive
Storage: Samsung 980 Pro 2 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive
Video Card: Gigabyte AORUS MASTER GeForce RTX 4090 24 GB Video Card
Case: Fractal Design Pop Air ATX Mid Tower Case
Power Supply: be quiet! Straight Power 11 1200 W 80+ Platinum Certified Fully Modular ATX Power Supply

I have also problems to install driver of 4090 in Ubuntu. I suppose that something is not complaint with Ubuntu.