GTX 1070M on Clevo P650RS (Sager NP8153-S) Falling off the bus

Edit 2017-01-10

I have, by running in hybrid mode, collected a proper log. I will attach it after this edit. At this point I just want confirmation that this is, indeed, a problem with the device itself (rather than a software problem) and that I should RMA the laptop.

Discrete mode bug report caught

Taking the cue from https://devtalk.nvidia.com/default/topic/985037/linux/gtx-1070-quot-gpu-has-fallen-off-the-bus-quot-running-3d-games-in-arch-linux-/ I’ve SSHd into the machine and caught a bug report, labeled here as 2017-01-10-discrete-nvidia-bug-report.log.gz.

== ORIGINAL MESSAGE ==
I am running the laptop in discrete mode only, UEFI boot mode. When the laptop is plugged in, and I run a graphically intensive application – such as a game, or just having a lot of browser windows open – the screen goes black, the fans kick up to their highest level, and the machine is unresponsive. Any music that was playing before the freeze continues playing, but no keystroke is registered.

Errors from journalctl -b -1 -t kernel show the following:

Dec 26 18:20:54 phenexa kernel: NVRM: GPU at PCI:0000:01:00: GPU-f7733d99-5bd6-40e8-3a87-98c81f45fb3e
Dec 26 18:20:54 phenexa kernel: NVRM: GPU Board Serial Number:
Dec 26 18:20:54 phenexa kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Dec 26 18:20:54 phenexa kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Dec 26 18:20:54 phenexa kernel: NVRM: GPU is on Board .
Dec 26 18:20:54 phenexa kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Dec 26 18:20:58 phenexa kernel: NVRM: RmInitAdapter failed! (0x12:0x45:1819)
Dec 26 18:20:58 phenexa kernel: NVRM: rm_init_adapter failed for device bearing minor number 0

Additional notes:

  • The thermals before a crash are all nominal (~52C).
  • The crash occasionally does not produce the “fallen off the bus” log messages
  • For a time, the crash was correlated with many ACPI warnings (argument #4 mismatches) but subsequent longer runs (unplugged) have since de-correlated them for me)

Unfortunately, I cannot run the bug report tool at that time, due to the system being unresponsive. I will attach a non-crashed version once I figure out how to do so here.

As a workaround, if the laptop is not plugged in, I have not yet experienced the crash; If I start Factorio while unplugged and play for a while first, I can then plug in the laptop again and not have the crash happen (at all, as far as my testing has gone) Skyrim (through WINE) is a different story, though, and will crash almost as soon as I re-plug-in the laptop after starting it.

The problem has grown progressively more frequent over the ~1 month I’ve had the laptop, so I am not ruling out a hardware issue, but I am unable to fully convince myself of that enough to perform an RMA. I have tried the nouveau driver – just to see if I could reproduce it with that, but I’m not sure it is capable of producing a load great enough to cause the issue.

Things I have tried:

  • Kernel parameter NVreg_Mobile=3
  • Kernel parameter acpi_irq_nobalance (many ACPI warnings usually accompany the drop, more recent long-runs have not correlated the two)
  • Kernel parameter NVreg_RegisterForACPIEvents=0
  • Installing intel-ucode into the boot sequence
  • Different combinations of the above.
    2017-01-10-nvidia-bug-report.log.gz (196 KB)

2017-01-10-discrete-nvidia-bug-report.log.gz (260 KB)

It’s benn getting more annoying (on vacation, really want to play games) so I wrote a script which follows the logs to obtain a crashed crash report:

#!/bin/sh
journalctl -b -t kernel -f | grep nvidia-bug-report | while `read i`; do nvidia-bug-report.sh; done

I ran this as root in the background to collect information when the device does fall off the bus. Unfontunately, it seems that everything is locked up at the time, and no bug report has been forthcoming, so the output posted originally will have to do.

Some background info for those who are better positioned to aid the OP than I:

Clevo P650RS - ArchWiki
https://wiki.archlinux.org/index.php/Clevo_P650RS

6:39
Sager NP8153-S (Clevo P650RS) Full Review - YouTube
https://www.youtube.com/watch?v=vCLmfunjQEk

Page 150 of the English, USA manual seems to indicate that ‘Secure Boot’ (and thus UEFI boot in favour of the CSM-enabled legacy boot?) can be disabled. (IMO *‘Secure Boot’ is a source of superfluous complexity and likely instability promoted by those who are also bent upon imposing permanently baked-in and proprietary out-of-band ‘remote management’ technologies in the design and manufacture of consumer-grade PCs).

CLEVO User Manual Download
http://clevo.com/en/e-services/download/USRManualOut.asp?model=P6xxRS&menual=+GO+

*After all, any sincere concern for consumers’ on-line security would include the industry-wide adoption of BIOS write-protect jumpers (a feature of the Talos Secure Workstation, BTW) and ECC RAM.

"It turns out that non-ECC RAM is actually a security risk, as bit flips can be exploited. “Bit-squatting” from Black Hat 2011:

Mar 15, 2013
Blackhat 2011 - Bit-squatting: DNS Hijacking without exploitation - YouTube
http://www.youtube.com/watch?v=_si0FYl_IOA

Bitsquatting: DNS Hijacking without exploitation
http://dinaburg.org/bitsquatting.html

*Secure Boot hacked
https://duckduckgo.com/?q=Secure+Boot+hacked&t=hu&ia=web

10 Aug 2016
*Bungling Microsoft singlehandedly proves that golden backdoor keys are a terrible idea • The Register
http://www.theregister.co.uk/2016/08/10/microsoft_secure_boot_ms16_100/

/rant

I’m going to have to write this off as a hardware problem – recently, the laptop failed to get past the POST without having the same crash. Also, as a further note, I have switched over to Hybrid mode, and will soon be able to produce the proper bug report using Bumblebee to keep my display available after the kernel module crashes.

Have you learned anything new with this issue?

I have the same laptop and it started doing this exactly as you described over the past couple of days, but only certain games, and not necessarily graphically demanding games.

For example, I’ve been able to play the latest Hitman at max graphic settings for hours without an issue while plugged in, but Mass Effect 2 which is pretty old, the black screen issue starts almost immediately after plugging in. Eventually, I get stuck on the desktop with the message “application has been blocked from accessing graphics hardware” and have to force quit the game.

As of a couple days ago, I can’t even get DeusEx Mankind Divided to start, it goes to an error message saying “a problem has occurred with your display driver (0x887A0007: DXGI_ERROR_DEVICE_RESET)”

I’ve had the laptop working perfectly for 5 months so it’s pretty sudden.

Hi Boiler,

No, I’ve learned nothing new. I RMA’d the laptop and got it back about 3 weeks ago with a new motherboard. The problem appeared to have gone away, but has started to return over the last couple days. Your problems sounds somewhat different, as you’re actually getting feedback from the system, rather than simply being locked out of all input with the screen staying on, but not responding.

I think the difference is, BoilerUp uses Windows (DXGI_ERROR is DirectX error). So probably same cause, different symptoms due to different OS.

Sorry, I didn’t realize that this was a Linux forum! Yes, I’m using windows 10 Pro. I was just excited to see someone else with this issue as I’m not having much luck finding any concrete info besides this thread.

Another error to throw into the hat, Grand theft auto 5 crashes immediately upon start with the error: “ERR_GFX_D3D_INIT. Failed Initialization.”

All these games have worked perfectly at >75fps before a few days ago and they continue to work fine with the laptop unplugged, just at reduced fps since the GPU isn’t working at full speed unplugged.

As far as drivers go there were no changes made by me between when everything worked and the problem starting. I updated my nvidia drivers after it started with no change. I then did a clean reinstall of the latest drivers and everything worked again for about an hour, then the problem started up again.

I wonder if the GPU is on its way out and can’t cope under full voltage? Sounds like in your case unless really unlucky with the replacement part, it’s probably not the mobo.

So you likely know about this forum’s Windows flip-side, eh?

GeForce 1000 Series Board - GeForce Forums
https://forums.geforce.com/default/board/172/geforce-1000-series/

BTW. Browse through the first link in my forum signature for a cross-platform collection of tid-bits including a growing glut of predominantly Windows 10 AU 1607 & Pascal-related info further down.

Oh yeah. There’s this SOP too:

Help others to help you. Please post complete and accurate system specs. (model numbers are good) in your forum signature. (You may have to log out and then log back in to edit said signature.)

Looks like the pascal series is straining the pcie that much so even the slightest quality problem leads to the gpu getting detached from the bus. Don’t know if you can set 2nd gen speeds instead of 3rd gen in your bios.