Ubuntu 18.04 freeze GPU fallen off the bus

configuration :
Dell Precision 7530
Ubuntu 18.04
kernel number: 5.0.0-36-generic
$ nvidia-smi
Tue Nov 19 10:05:23 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P2000 Off | 00000000:01:00.0 Off | N/A |
| N/A 52C P8 N/A / N/A | 273MiB / 4040MiB | 8% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1244 G /usr/lib/xorg/Xorg 144MiB |
| 0 2069 G /usr/bin/gnome-shell 124MiB |
| 0 3300 G /usr/lib/firefox/firefox 1MiB |
±----------------------------------------------------------------------------+

It has been several months that my computer is doing complete freeze (no way to open a terminal), the only thing answering is the shortcut to keyboard light (similar shortcut to screen luminosity does nothing). Everytime, the last lines of syslog before the freeze are like:

Nov 19 09:33:59 atlas kernel: [ 2925.348909] NVRM: GPU at PCI:0000:01:00: GPU-a729e359-5a4d-43ef-e1a3-07447144413f
Nov 19 09:33:59 atlas kernel: [ 2925.348926] NVRM: Xid (PCI:0000:01:00): 79, pid=1473, GPU has fallen off the bus.
Nov 19 09:33:59 atlas kernel: [ 2925.348927] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Nov 19 09:33:59 atlas kernel: [ 2925.348933] NVRM: A GPU crash dump has been created. If possible, please run
Nov 19 09:33:59 atlas kernel: [ 2925.348933] NVRM: nvidia-bug-report.sh as root to collect this data before
Nov 19 09:33:59 atlas kernel: [ 2925.348933] NVRM: the NVIDIA kernel module is unloaded.

Nothing specific could be linked to these random freezes.

I have seen that other people have similar issues when PowerMizer changes state. How could I check the log of those states to compare it with syslog?

The output of the last run of nvidia-bug-report.sh is attached.
nvidia-bug-report.log.gz (1.03 MB)

XID 79 on a notebook rather points to a hw failure. You could monitor temperatures in case of clogged cooling, otherwise try installing Windows and see if you can replicate the issue, then RMA if possible.

Hi generix,
I doubt it is hardware related as

  • an other person on my team with the same machine adquired at 2 months interval has the same issue.
  • the two machines are new (<1 year)
  • the bug usually happen within an hour of starting or waking up the computer

Anyway, how can I monitor temperatures for the graphic card ?

I can’t install windows, no licence. What is an RMA ?

RMA=Return to Manufacturing Assembly, getting it replaced if still under warranty.
There’s one firmware/cpu bug I know of, mainly affecting a specific HP notebook, you could try if that’s the case on your hw, too:
Please set the kernel parameter
intel_idle.max_cstate=1
and check if the gpu still falls off the bus.
You can monitor temperatures either using nvidia-smi or nvidia-settings.

Ok, I’m trying it, I’ll come back next time it bugs…

Looking for a bios update is also worthwhile. Also, another approach would be to disable “c-state control” in bios.

I had already updated to the latest BIOS and unfortunately I already had a freeze with the intel_idle.max_cstate=1 kernel model.
I’m currently trying to change the parameters of the GPU. Currently testing if enabling Persistence changes something.

Unfortunately enabling Persistence didn’t prevent freeze.

One more freeze this morning after wake up from suspend with battery low: freeze within 2 minutes of waking up, like Monday on relatively low battery power. Might be a thing.

Trying setting powermizer level to 1 (Prefer maximum performance) by setting
/usr/bin/nvidia-settings -a “[gpu:0]/GpuPowerMizerMode=1”
as a startup application as described in https://rastating.github.io/how-to-permanently-set-nvidia-powermizer-settings-in-ubuntu/

I can’t find any nvidia specific logs (to check possible status changes related to the bug) are they any that can be activated?

All nvidia driver messages are dmesg. That won’t help you, though, since XID 79 is a low level HW issue. The gpu either shut down due to overheating/power issues or got detached from the bus due to pcie bus problems. Nothing the driver can do but telling that the gpu is gone. The gpu obviously also can’t tell you why because it can’t communicate with the driver anymore.

In any case, desactivating completely the nvidia card seems to completely prevent the bug :
nvidia-settings > PRIME Profiles > Select the GPU you would like to use > Intel (Power Saving Mode)

If you turn the gpu off, of course this prevents it from turning itself off unexpectedly.
Did you try to disable “c-states control” in bios?

I didn’t try to disable “c-states control” in bios but setting the driver to max performance, as described in my post of the Posted 11/20/2019 07:34 AM worked flawlessly … but has been draining the battery at an incredible rate!
I’ll try the “c-states control” trick

I didn’t try to disable “c-states control” in bios but setting the driver to max performance, as described in my post of the Posted 11/20/2019 07:34 AM worked flawlessly … but has been draining the battery at an incredible rate!
I’ll try the “c-states control” trick.

I didn’t try to disable “c-states control” in bios but setting the driver to max performance, as described in my post of the Posted 11/20/2019 07:34 AM worked flawlessly … but has been draining the battery at an incredible rate!
I’ll try the “c-states control” trick

I didn’t try to disable “c-states control” in bios but setting the driver to max performance, as described in my post of the Posted 11/20/2019 07:34 AM worked flawlessly … but has been draining the battery at an incredible rate!
I’ll try the “c-states control” trick