Freeze with GPU has fallen off the bus on RTX 3080 16GB Laptop (AORUS 15P YD)

Hi,

I have an RTX 3080 Laptop (16GB) in an AORUS 15P YD laptop, in which I reliably get a “GPU has fallen off the bus” error some time after booting (sometimes on login screen, sometimes after a few - usually 0-20 - minutes after login) when the nvidia GPU is used for anything related to the X server or DRM.

First of all, I’m (mostly) excluding a hardware defect, as windows with the most recent nvidia drivers works stably and I had no issues so far.

Ideally, this should run on ubuntu 20.04 (which uses 5.13 kernel), although I have tried going up to 21.10, POP!OS (which uses 5.15), as well as down to 5.8 and 5.4.

The only thing I need to change from a stock Ubuntu 20.04 install is to install nvidia drivers.

I tried nvidia drivers 460, 470, 495, 510.47 and 510.54.

The nouveau driver works (mostly?) stable, although i’ve not tested this enough to say it with confidence.
Using nouveau is not really an option for me, as this is my work laptop and I require both OpenGL as well as CUDA to be runnable on the NVIDIA GPU.

I also tried all available BIOS versions available for the laptop, all of which work fine in windows, but the problem persists on linux.

I also tried disabling a bunch of power management features, such as D3 power management, or disabling pci_port_pm or pcie_aspm, and various options related to acpi, as I had suspected that the card is powered down or off while in use, which might cause the driver to crash.

I also noticed that the sound card is bound by an intel driver, which i found weird, but i don’t know if that might cause problems:

#output section from lspci -vvv after freeze

01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

01:00.1 Audio device: NVIDIA Corporation Device 228b (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

The only sort of “workaround” I have found so far for the nvidia GPU to not fall off the bus is when I use Compute mode from the system76-power library, which adds the following lines to /etc/modprobe.d/system76-power.conf:

# Automatically generated by system76-power
blacklist i2c_nvidia_gpu
blacklist nvidia-drm
blacklist nvidia-modeset
alias i2c_nvidia_gpu off
alias nvidia-drm off
alias nvidia-modeset off
options nvidia NVreg_DynamicPowerManagement=0x02
# Preserve video memory through suspend
options nvidia NVreg_PreserveVideoMemoryAllocations=1

interestingly, in Hybrid mode, which adds the following to /etc/modprobe.d/system76-power.conf:

# Automatically generated by system76-power
blacklist i2c_nvidia_gpu
alias i2c_nvidia_gpu off
options nvidia NVreg_DynamicPowerManagement=0x02
options nvidia-drm modeset=1
# Preserve video memory through suspend
options nvidia NVreg_PreserveVideoMemoryAllocations=1

the gpu still falls off the bus.
therefore I suspect that either nvidia-drm or nvidia-modeset is the failing component here, although that is more of a guess than anything else.

I have attached one nvidia debug report log. I can generate more if necessary. I don’t have a reliable method to immediately trigger the freeze, but i can make it happen within a reasonable timeframe.

Sorry for the wall of text, here’s a TLDR:
Error: GPU has fallen off the bus
Steps to reproduce:

  1. Install Fresh ubuntu 20.04 or 21.10 or POP!OS 21.10 on AORUS 15P YD
  2. Install NVIDIA drivers (e.g. sudo apt install nvidia-driver-510 or 470 or 460 or 495)
  3. Reboot
  4. Wait for crash
  5. System freezes and is unresponsive, although ssh still works.
  6. Check dmesg, find:
...
[   22.079296] audit: type=1400 audit(1645793344.680:44): apparmor="DENIED" operation="open" profile="snap.snap-store.ubuntu-software" name="/etc/PackageKit/Vendor.conf" pid=2247 comm="snap-store" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[   30.848138] NVRM: GPU at PCI:0000:01:00: GPU-e5a2c765-97ab-76de-eaf8-021ea4ed93bc
[   30.848143] NVRM: Xid (PCI:0000:01:00): 79, pid=2835, GPU has fallen off the bus.
[   30.848146] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[   30.848190] NVRM: GPU 0000:01:00.0: GPU serial number is \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff.
[   40.884545] Asynchronous wait on fence NVIDIA:nvidia.prime:2e6 timed out (hint:intel_atomic_commit_ready [i915])
[  100.671825] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67d:0:0:0x0000000f
...

I’d really appreciate any suggestions!
Thanks in advance!

nvidia-bug-report.log.gz (1.2 MB)

Based on recommendations on another post in this forum [Link], I have tried the liquorix kernel from here:
Liquorix Kernel (version 5.16.0-11.1 is currently the latest)
which has been stable for over 55 minutes now, which I think is a new record for me.

For anyone coming across this post, try that.

I will report back again here if it crashes.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.