I have an RTX 3080 Laptop (16GB) in an AORUS 15P YD laptop, in which I reliably get a “GPU has fallen off the bus” error some time after booting (sometimes on login screen, sometimes after a few - usually 0-20 - minutes after login) when the nvidia GPU is used for anything related to the X server or DRM.
First of all, I’m (mostly) excluding a hardware defect, as windows with the most recent nvidia drivers works stably and I had no issues so far.
Ideally, this should run on ubuntu 20.04 (which uses 5.13 kernel), although I have tried going up to 21.10, POP!OS (which uses 5.15), as well as down to 5.8 and 5.4.
The only thing I need to change from a stock Ubuntu 20.04 install is to install nvidia drivers.
I tried nvidia drivers 460, 470, 495, 510.47 and 510.54.
The nouveau driver works (mostly?) stable, although i’ve not tested this enough to say it with confidence.
Using nouveau is not really an option for me, as this is my work laptop and I require both OpenGL as well as CUDA to be runnable on the NVIDIA GPU.
I also tried all available BIOS versions available for the laptop, all of which work fine in windows, but the problem persists on linux.
I also tried disabling a bunch of power management features, such as D3 power management, or disabling pci_port_pm or pcie_aspm, and various options related to acpi, as I had suspected that the card is powered down or off while in use, which might cause the driver to crash.
I also noticed that the sound card is bound by an intel driver, which i found weird, but i don’t know if that might cause problems:
#output section from lspci -vvv after freeze 01:00.0 VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 01:00.1 Audio device: NVIDIA Corporation Device 228b (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel
The only sort of “workaround” I have found so far for the nvidia GPU to not fall off the bus is when I use Compute mode from the system76-power library, which adds the following lines to /etc/modprobe.d/system76-power.conf:
# Automatically generated by system76-power blacklist i2c_nvidia_gpu blacklist nvidia-drm blacklist nvidia-modeset alias i2c_nvidia_gpu off alias nvidia-drm off alias nvidia-modeset off options nvidia NVreg_DynamicPowerManagement=0x02 # Preserve video memory through suspend options nvidia NVreg_PreserveVideoMemoryAllocations=1
interestingly, in Hybrid mode, which adds the following to /etc/modprobe.d/system76-power.conf:
# Automatically generated by system76-power blacklist i2c_nvidia_gpu alias i2c_nvidia_gpu off options nvidia NVreg_DynamicPowerManagement=0x02 options nvidia-drm modeset=1 # Preserve video memory through suspend options nvidia NVreg_PreserveVideoMemoryAllocations=1
the gpu still falls off the bus.
therefore I suspect that either nvidia-drm or nvidia-modeset is the failing component here, although that is more of a guess than anything else.
I have attached one nvidia debug report log. I can generate more if necessary. I don’t have a reliable method to immediately trigger the freeze, but i can make it happen within a reasonable timeframe.
Sorry for the wall of text, here’s a TLDR:
Error: GPU has fallen off the bus
Steps to reproduce:
- Install Fresh ubuntu 20.04 or 21.10 or POP!OS 21.10 on AORUS 15P YD
- Install NVIDIA drivers (e.g. sudo apt install nvidia-driver-510 or 470 or 460 or 495)
- Wait for crash
- System freezes and is unresponsive, although ssh still works.
- Check dmesg, find:
... [ 22.079296] audit: type=1400 audit(1645793344.680:44): apparmor="DENIED" operation="open" profile="snap.snap-store.ubuntu-software" name="/etc/PackageKit/Vendor.conf" pid=2247 comm="snap-store" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0 [ 30.848138] NVRM: GPU at PCI:0000:01:00: GPU-e5a2c765-97ab-76de-eaf8-021ea4ed93bc [ 30.848143] NVRM: Xid (PCI:0000:01:00): 79, pid=2835, GPU has fallen off the bus. [ 30.848146] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. [ 30.848190] NVRM: GPU 0000:01:00.0: GPU serial number is \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff. [ 40.884545] Asynchronous wait on fence NVIDIA:nvidia.prime:2e6 timed out (hint:intel_atomic_commit_ready [i915]) [ 100.671825] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67d:0:0:0x0000000f ...
I’d really appreciate any suggestions!
Thanks in advance!
nvidia-bug-report.log.gz (1.2 MB)