Edit 2017-01-10
I have, by running in hybrid mode, collected a proper log. I will attach it after this edit. At this point I just want confirmation that this is, indeed, a problem with the device itself (rather than a software problem) and that I should RMA the laptop.
Discrete mode bug report caught
Taking the cue from [url]https://devtalk.nvidia.com/default/topic/985037/linux/gtx-1070-quot-gpu-has-fallen-off-the-bus-quot-running-3d-games-in-arch-linux-/[/url] I’ve SSHd into the machine and caught a bug report, labeled here as 2017-01-10-discrete-nvidia-bug-report.log.gz.
== ORIGINAL MESSAGE ==
I am running the laptop in discrete mode only, UEFI boot mode. When the laptop is plugged in, and I run a graphically intensive application – such as a game, or just having a lot of browser windows open – the screen goes black, the fans kick up to their highest level, and the machine is unresponsive. Any music that was playing before the freeze continues playing, but no keystroke is registered.
Errors from journalctl -b -1 -t kernel
show the following:
Dec 26 18:20:54 phenexa kernel: NVRM: GPU at PCI:0000:01:00: GPU-f7733d99-5bd6-40e8-3a87-98c81f45fb3e
Dec 26 18:20:54 phenexa kernel: NVRM: GPU Board Serial Number:
Dec 26 18:20:54 phenexa kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Dec 26 18:20:54 phenexa kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Dec 26 18:20:54 phenexa kernel: NVRM: GPU is on Board .
Dec 26 18:20:54 phenexa kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Dec 26 18:20:58 phenexa kernel: NVRM: RmInitAdapter failed! (0x12:0x45:1819)
Dec 26 18:20:58 phenexa kernel: NVRM: rm_init_adapter failed for device bearing minor number 0
Additional notes:
- The thermals before a crash are all nominal (~52C).
- The crash occasionally does not produce the “fallen off the bus” log messages
- For a time, the crash was correlated with many ACPI warnings (argument #4 mismatches) but subsequent longer runs (unplugged) have since de-correlated them for me)
Unfortunately, I cannot run the bug report tool at that time, due to the system being unresponsive. I will attach a non-crashed version once I figure out how to do so here.
As a workaround, if the laptop is not plugged in, I have not yet experienced the crash; If I start Factorio while unplugged and play for a while first, I can then plug in the laptop again and not have the crash happen (at all, as far as my testing has gone) Skyrim (through WINE) is a different story, though, and will crash almost as soon as I re-plug-in the laptop after starting it.
The problem has grown progressively more frequent over the ~1 month I’ve had the laptop, so I am not ruling out a hardware issue, but I am unable to fully convince myself of that enough to perform an RMA. I have tried the nouveau
driver – just to see if I could reproduce it with that, but I’m not sure it is capable of producing a load great enough to cause the issue.
Things I have tried:
- Kernel parameter NVreg_Mobile=3
- Kernel parameter acpi_irq_nobalance (many ACPI warnings usually accompany the drop, more recent long-runs have not correlated the two)
- Kernel parameter NVreg_RegisterForACPIEvents=0
- Installing intel-ucode into the boot sequence
- Different combinations of the above.
2017-01-10-nvidia-bug-report.log.gz (196 KB)