Since I received two RTW 4090, both make my system completely freezing, after 1 and 3 minutes after each boot, no matter what I’m doing (idle on desktop or trying a game).
Little summary:
Here is my everyday setup without any freeze:
System: Kernel: 5.15.0-52-generic x86_64 bits: 64 compiler: N/A Desktop: Cinnamon 5.2.7
wm: muffin dm: LightDM Distro: Linux Mint 20.3 Una base: Ubuntu 20.04 focal
Machine: Type: Desktop System: ASUS product: N/A v: N/A serial: <filter>
Mobo: ASUSTeK model: ROG MAXIMUS Z690 HERO v: Rev 1.xx serial: <filter>
UEFI: American Megatrends v: 1304 date: 03/07/2022
Battery: Device-1: hid-80:4a:14:6d:aa:d5-battery model: Magic Keyboard with Numeric Keypad
serial: N/A charge: N/A status: Discharging
CPU: Topology: 12-Core model: 12th Gen Intel Core i9-12900K bits: 64 type: MT MCP arch: N/A
L2 cache: 30.0 MiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 152985
Speed: 700 MHz min/max: 800/5200 MHz Core speeds (MHz): 1: 700 2: 701 3: 700 4: 700
5: 786 6: 800 7: 700 8: 701 9: 800 10: 800 11: 800 12: 800 13: 700 14: 800 15: 800
16: 800 17: 800 18: 700 19: 700 20: 700 21: 700 22: 700 23: 798 24: 800
Graphics: Device-1: NVIDIA vendor: ASUSTeK driver: nvidia v: 520.56.06 bus ID: 01:00.0
chip ID: 10de:2206
Display: x11 server: X.Org 1.20.13 driver: nvidia tty: N/A
OpenGL: renderer: NVIDIA GeForce RTX 3080/PCIe/SSE2 v: 4.6.0 NVIDIA 520.56.06
direct render: Yes
sudo lspci
:
00:00.0 Host bridge: Intel Corporation Device 4660 (rev 02)
00:01.0 PCI bridge: Intel Corporation Device 460d (rev 02)
00:06.0 PCI bridge: Intel Corporation Device 464d (rev 02)
00:0a.0 Signal processing controller: Intel Corporation Device 467d (rev 01)
00:0e.0 RAID bus controller: Intel Corporation Volume Management Device NVMe RAID Controller
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:14.3 Network controller: Intel Corporation Device 7af0 (rev 11)
00:15.0 Serial bus controller [0c80]: Intel Corporation Device 7acc (rev 11)
00:15.1 Serial bus controller [0c80]: Intel Corporation Device 7acd (rev 11)
00:15.2 Serial bus controller [0c80]: Intel Corporation Device 7ace (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1b.0 PCI bridge: Intel Corporation Device 7ac0 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7ab8 (rev 11)
00:1c.1 PCI bridge: Intel Corporation Device 7ab9 (rev 11)
00:1c.3 PCI bridge: Intel Corporation Device 7abb (rev 11)
00:1c.4 PCI bridge: Intel Corporation Device 7abc (rev 11)
00:1d.0 PCI bridge: Intel Corporation Device 7ab0 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a84 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2206 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
05:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
06:00.0 Ethernet controller: Intel Corporation Device 15f3 (rev 03)
07:00.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:00.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:01.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:02.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
08:03.0 PCI bridge: Intel Corporation Device 1136 (rev 02)
09:00.0 USB controller: Intel Corporation Device 1137
3d:00.0 USB controller: Intel Corporation Device 1138
sudo lspci -tv
-[0000:00]-+-00.0 Intel Corporation Device 4660
+-01.0-[01]--+-00.0 NVIDIA Corporation Device 2206
| \-00.1 NVIDIA Corporation Device 1aef
+-06.0-[02]----00.0 Samsung Electronics Co Ltd Device a80a
+-0a.0 Intel Corporation Device 467d
+-0e.0 Intel Corporation Volume Management Device NVMe RAID Controller
+-14.0 Intel Corporation Device 7ae0
+-14.2 Intel Corporation Device 7aa7
+-14.3 Intel Corporation Device 7af0
+-15.0 Intel Corporation Device 7acc
+-15.1 Intel Corporation Device 7acd
+-15.2 Intel Corporation Device 7ace
+-16.0 Intel Corporation Device 7ae8
+-17.0 Intel Corporation Device 7ae2
+-1b.0-[03]--
+-1c.0-[04]--
+-1c.1-[05]----00.0 ASMedia Technology Inc. ASM1062 Serial ATA Controller
+-1c.3-[06]----00.0 Intel Corporation Device 15f3
+-1c.4-[07-70]----00.0-[08-70]--+-00.0-[09]----00.0 Intel Corporation Device 1137
| +-01.0-[0a-3c]--
| +-02.0-[3d]----00.0 Intel Corporation Device 1138
| \-03.0-[3e-70]--
+-1d.0-[71]--
+-1f.0 Intel Corporation Device 7a84
+-1f.3 Intel Corporation Device 7ad0
+-1f.4 Intel Corporation Device 7aa3
\-1f.5 Intel Corporation Device 7aa4
As soon as I try to switch my 3080 with one of the RTX 4090, freeze occurs straight away, here is the systemctl
log:
Oct 18 20:26:07 linuxmint kernel: NVRM: GPU at PCI:0000:01:00: GPU-bebd8b3c-a7ea-a03e-ee01-4c6d74304ee3
Oct 18 20:26:07 linuxmint kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 18 20:26:07 linuxmint kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:01.0
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: device [8086:460d] error status/mask=00100000/00010000
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: [20] UnsupReq (First)
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: AER: TLP Header: 34000000 01000010 00000000 00000000
Oct 18 20:26:07 linuxmint kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Oct 18 20:26:07 linuxmint kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Oct 18 20:26:07 linuxmint kernel: pcieport 0000:00:01.0: AER: device recovery failed
I tried pci=check_enable_amd_mmconf
and idle=nomwait
as kernel parameters, as long as the others described here "GPU has fallen off the bus" while idle, only occurs when all displays powered off - #2 by BCS, and also pci=noaer
in grub boot option.
the PSU is a 1000w Gigabyte UD1000GM PG5 with a PCIe Gen 5.0 (1 x 16 pines) port (and no issue for hours with same setup under Windows).
I also tried to uninstall the driver: sudo apt remove --purge nvidia-*
and reinstall it, but same issue. Sometimes I have time to launch a game, and it work in ultra full speed 180hz.
I tried with a fresh linux mint 21 installation + 520 drivers and Steam, same issue: freeze
Same GPU on windows 11
, no issues.
The thermal are also good: https://i.imgur.com/DA4ER2O.png
I’m beetween Kernel or driver bug, what can I try to find what’s the problem ?