NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5

vaslion13/Reishook, it’s kind odd that c-states control bios setting had any effect, those settings should normally be ignored by the intel_idle driver, only used by the older acpi_idle driver. C-states can be controlled by kernel parameters, i.e.

intel_idle.max_cstate=1

Limits the c-state usage to C0 and C1.
maurky, your issue looks like a suspend/resume issue, completly different. Please try kernel parameter
acpi_osi=! acpi_osi=“Windows 2009”
which works around some suspend/resume issues with the nvidia gpu. If that does not help, please open a new thread.

I tried acpi_osi parameter but nothing changes.

I’ve opened a new thread as requested:

[url]Black screen and crash on Debian 10.1, NVd: 418.74, "13, Graphics Exception ESR" (GPU has fallen off the bus) - Linux - NVIDIA Developer Forums

It seems like I have the same problem as vaslion13.
HP studio G5 with Quadro p1000.

Several linux distributions with graphics freeze, regardless of kernel version or nvidia driver version.
I did not get around to disabling the discrete graphics card until I found this post.
I can add that the system is still responding to remote access by ssh when the graphics has frozen.
I am now trying with the suggested kernel parameter “intel_idle.max_cstate=1” at startup and has so far not experienced any freeze after a few hours testing. The freeze normally came within 5 minutes before adding this kernel parameter.

I have disable the discrete card for some time now and bought a cable from usb-c to hdmi in order to have a second monitor. I have not tried the “intel_idle.max_cstate=1” parameter.

Jontis can you please report back if it is resolving the issue for you the use of the parameter “intel_idle.max_cstate=1” ?

As a note, the i7-8750H supports c-states 0,1,3,6,7,8,9,10
If limiting c-states is helping with the gpu falling off the bus, it might be interesting from which c-state on this is happening.
Ultimately, this would point to a bug in either kernel, cpu or bios.

The system has been running stable now for more than a day and I have so far not seen the freeze since including the kernel parameter “intel_idle.max_cstate=1”.
It had never run for more than 15 minutes without that parameter.

Will try to experiment with higher c-states later to see if I can pinpoint when the problems come.

Following, I have exactly the same setup (zbook 15 studio g5 + quadro p1000) and the same problem. I can only run Linux using the onboard graphics card, using the Nvidia card it also crashes within a few minutes…

Also tried different distributions, arch, Ubuntu and Fedora, but no difference, looks like a issue with this model…

Reishook, since you discovered the workaround on a different system, would you mind providing an nvidia-bug-report.log so similarities between both systems can be found?

Hello.
I have this hardware:

System:    Host: DellBell Kernel: 5.3.5-1-default x86_64 bits: 64 compiler: gcc v: 9.2.1 Desktop: KDE Plasma 5.17.0 
           tk: Qt 5.13.1 wm: kwin_x11 dm: SDDM Distro: openSUSE Tumbleweed 20191016 
Machine:   Type: Laptop System: Dell product: G3 3779 v: N/A serial: <root required> Chassis: type: 10 
           serial: <root required> 
           Mobo: Dell model: 04R93M v: A00 serial: <root required> UEFI: Dell v: 1.4.0 date: 09/05/2018 
CPU:       Topology: 6-Core model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Kaby Lake rev: A L2 cache: 9216 KiB 
           flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 52799 
           Speed: 1119 MHz min/max: 800/4100 MHz Core speeds (MHz): 1: 872 2: 897 3: 867 4: 819 5: 884 6: 858 7: 897 
           8: 819 9: 844 10: 895 11: 912 12: 886 
Graphics:  Device-1: Intel UHD Graphics 630 vendor: Dell driver: i915 v: kernel bus ID: 00:02.0 chip ID: 8086:3e9b 
           Device-2: NVIDIA GP106M [GeForce GTX 1060 Mobile] vendor: Dell driver: nvidia v: 435.21 bus ID: 01:00.0 
           chip ID: 10de:1c20 
           Display: x11 server: X.Org 1.20.5 driver: modesetting,nvidia alternate: intel compositor: kwin_x11 
           resolution: 1920x1080~60Hz 
           OpenGL: renderer: GeForce GTX 1060 with Max-Q Design/PCIe/SSE2 v: 4.6.0 NVIDIA 435.21 direct render: Yes

I have turned on c-states in the BIOS to test the intel_idle.max_cstate parameter.
I tried it with values from 1 to 10 and with all this value my nvidia card not fallen from bus. I tried boot without intel_idle parameter and my nvidia card not fallen from bus too.
So, I can say that on current kernel and nvidia drivers on my notebook problem was resolvе perhaps after kernel update.
Nevertheless, I set intel_idle.max_cstate=6 and now i testing it. On a higher values i’ve got periodical problems with OpenGL in KDE after cold boot but it’s not nvidia issue i’ve got this probplems on intel card too.

Thank you, Reishook.
Since you have the same i7-8750H cpu, there seems to be something wrong with that model. Though it’s rather odd that it fixed itself for you, there haven’t been any functional changes to the intel_idle driver for a long time, also you bios date is 09/05/2018 so probably not updated meanwhile. Maybe changes in the pcie driver. Do you remember when you first discovered the workaround?

Yes, i’ve discovered it on ~ 29 September after read this topic and message about that when battery is charging card not falling from bus. I have 5.2.14 kernel and nvidia-435.21 at that time and with c-states my card not working more then 5-10 minutes. Before i’ve used intel+nouveau drivers because nvidia driver not worked for me. Yes, one more trick that i’ve got from pre-intalled Ubuntu on my notebook is acpi_osi=Linux-Dell-Video kernel parameter.
The fact that my system can working with c-states, I discovered yesterday. ;-)

I have noticed that it’s not freezing for me whenever my laptop is connected to the charger, so it looks like a power issue with the driver.

I discovered this after installing Ubuntu 19.04 using the recommended nvidia drivers. (Didn’t work on older versions for me)

FWIW, another datapoint:

HP ZBook Studio G5
Quadro P1000 Mobile
NVIDIA-Linux-x86_64-430.50

Laptop does not hang when plugged into docking station (which connects two external monitors).
Laptop will ultimately hang when not on docking station even when plugged into a regular external power source. There appears to be some difference in when exactly, but really it’s random. Working with an external monitor or power supply seems to help. Keep working on the laptop really seems to help (but not always)

Usually it will hang after leaving the laptop idle for a few minutes.
Afaict there is no suspending when on plugged into a power source and it hangs before even blanking the screen, so I doubt this is a suspension problem. Also I’ve tried running with

xset s off
xset -dpms
xset s noblank

with no noticeable difference.

okt 21 18:34:29 ltcmc2019-2 kernel: NVRM: GPU at PCI:0000:01:00: GPU-34790a52-7e95-3466-3b05-0861e2979698
okt 21 18:34:29 ltcmc2019-2 kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=1473, GPU has fallen off the bus.
okt 21 18:34:29 ltcmc2019-2 kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
okt 21 18:34:29 ltcmc2019-2 kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                    NVRM: nvidia-bug-report.sh as root to collect this data before
                                    NVRM: the NVIDIA kernel module is unloaded.
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:1:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:1:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:1:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
okt 21 18:34:40 ltcmc2019-2 kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:1:0:0x0000000f
okt 21 18:34:43 ltcmc2019-2 /usr/libexec/gdm-x-session[1973]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x02c7, 0x0000f894, 0x0000018c)
okt 21 18:34:50 ltcmc2019-2 /usr/libexec/gdm-x-session[1973]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x02c7, 0x0000f894, 0x0000018c)
okt 21 18:34:53 ltcmc2019-2 /usr/libexec/gdm-x-session[1973]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x02c4, 0x0000f894, 0x0000018c)

$ lspci | grep -Ei "vga|3d"
01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P1000 Mobile] (rev ff)

nvidia-smi showed no sign of overheating.

System was still usable over ssh, so I ran nvidia-bug-report.sh before rebooting.
Output here: [sl]https://…/nvidia-bug-report.log.gz[/s] no longer online

Rebooting did not work. SSH connection was dropped, but the display didn’t change, so the shutdown process hung too.

Did you check if using intel_idle.max_cstate=1 fixes the situation? A problem with power supplied would occur under high load, the fact that it’s falling off the bus when idle points to a problem with cstates on that model. The bios sets different c-state levels depending on power source/state.

Ah yes, that does seem to do the trick.
Thanks!

I’ve been fighting with this issue for quite a while: “GPU has fallen off the bus” after few minutes while on battery power.
It seems that I’ve fixed the problem for myself.
The last thing that I’ve changed was to add the “pcie_aspm=off” grub option and after that change it seems to work on battery pretty well. Btw, my system does not have any ASPM setting in BIOS, I guess it’s enabled by default.
Maybe the fix has something to do with a combination of other things that I changed while trying to resolve this problem. So here is a list of other changes that I made:

  • added this to /etc/modprobe.d/nvidia.conf: options nvidia "NVreg_DynamicPowerManagement=0x02" (and ran the update-initramfs -u)
  • blacklisted nouveau: "lsmod|grep nouv" does not show anything
  • enabled modeset by making this change in /etc/modprobe.d/nvidia-drm-nomodeset.conf: options nvidia-drm modeset=1
  • enabled persistence mode with nvidia-smi -i -pm ENABLED

Some info about my configuration:
I’m using the Dell G3 with this card:
01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1)
I’m running Kali-Rolling kernel 5.3.0-kali3-amd64 with Nvidia Driver Version 440.44

Hi,
I have the same problem… I have lenovo p1 gen1 with p1000. I tried Ubuntu, Debian and now Manjaro and still the same. Sometimes everything is ok for couple hours, sometimes gpu crash after 5-10 minutes…
On windows everything is ok, nothing helps.

Please open a new thread, run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

I’ve been fighting the same issue too. For me pcie_aspm=off didn’t work but I found a solution that works on my computer by disabling ASPM of the GPU discretely. I explained it in this reply. Maybe it can help you.