GTX 1070 "GPU has fallen off the bus" running 3D games in Arch Linux

I’ve been trying to troubleshoot 3d games resulting in the GPU falling off the bus. I’ve run out of avenues to explore and am looking for any other suggestions of what I should look into before deciding to call this a hardware problem and pursue an RMA.

Dmesg output:
[ 189.427267] NVRM: GPU at PCI:0000:01:00: GPU-73236338-bf17-442f-b881-d785485aa3bf
[ 189.427287] NVRM: GPU Board Serial Number:
[ 189.427290] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.

[ 189.427296] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 189.427312] NVRM: GPU is on Board .
[ 189.427325] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 204.377661] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 204.378782] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 204.379516] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 204.380177] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f

Background details:
Eurocom Toronado F5 (MSI 16L13), i7-6700 cpu, GTX1070 gpu

hotbox% uname -a
Linux hotbox 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 07:24:34 CET 2016 x86_64 GNU/Linux

hotbox% pacman -Ss nvidia | grep installed
extra/libvdpau 1.1.1-2 [installed]
extra/libxnvctrl 375.26-1 [installed]
extra/nvidia 375.26-1 [installed]
extra/nvidia-libgl 375.26-2 [installed]
extra/nvidia-settings 375.26-1 [installed]
extra/nvidia-utils 375.26-2 [installed]
multilib/lib32-nvidia-libgl 375.26-2 [installed]
multilib/lib32-nvidia-utils 375.26-2 [installed]

hotbox% lsmod | grep nvidia
nvidia_drm 49152 1
nvidia_modeset 782336 4 nvidia_drm
nvidia 11870208 65 nvidia_modeset
drm_kms_helper 126976 1 nvidia_drm
drm 294912 4 nvidia_drm,drm_kms_helper

Symptoms:
Running 3d games inevitably causes the gpu to fall off the bus, resulting in a blackscreen and the inability to use directly connected input devices (keyboard, mouse). Any background music continues to play. GPU temps remain between 40 and 60.

Running “The Long Dark” through the native Linux Steam client allows playability while remaining in interior locations. A crash will typically occur within a few minutes of entering an outside location, though on one occasion I was able to start a new game and play for roughly an hour.

Running “Insurgency” through Steam crashes shortly after the map has finished loading, though again there was an occasion where I was able to play longer.

When I run “Drunken Robot Pornography” or “Ziggurat” through Steam and “Mass Effect” through WINE, I get substanially longer game play - up to several hours on a stretch in “Mass Effect.”

I have yet to experience a crash in a 2d game, but haven’t put a lot of time into testing them. Day to day work with office tools, web browsing and media playback are all fine.

Troubleshooting steps:
I am able to start an SSH session, which I’ve used to collect the nvidia bug report and output of dmesg, journalctl -xe and Xorg.0.log immediately after a crash. (All should be attached)

After a crash nvidia-smi -r reports that the gpu is unable to be restarted and the system must be rebooted.

Using the Nvidia Settings utility to set perfomance to maximum and nvidia-smi to toggle persistance mode on/off has not made a difference. It appears I am unable to turn off ECC mode for testing purposes.

Previous logs mentioned ‘irq 16: nobody cared (try booting with the “irqpoll” option)’ immediately before the crash. Adding the irqpoll option as suggested continues to result in the crash and yeilds lots of messages about hpet losing large amounts of rtc interupts leading up to and after the crash. Adding the hpet=disable option fixes them, but still doesn’t solve the problem.

Nouveau seems to work, but yeilds one frame per second in (admittedly not comprehensive) testing so it’s not a feasible solution.

I found the following thread reporting very similar hardware and symptoms:
https://devtalk.nvidia.com/default/topic/984339/linux/gtx-1070m-on-clevo-p650rs-falling-off-the-bus/

It made the most sense for me to start a new thread, but perhaps the similarities warrant a merge.

Thank you for any help you can offer.
nvidia-bug-report.log.gz (269 KB)
dmesg.txt (89.4 KB)
journalctl.txt (97.8 KB)
xorgLog.txt (31.8 KB)

Please share output of dmidecode command. Are you using steam client to play games? Make sure there is no any thermal or power issue to GPU and System. What desktop env you are running KDe, Gnome or else?

Hi Sandip, thanks for your direction.

Output of dmidecode should be attached.

I’ve tried running games both with and without the Steam client. The problem is reproduceable both ways. If you need, I can supply the crash logs from running a game without the Steam client.

I’m certain there isn’t a themal issue as I’ve monitored temps leading up to the GPU falling off the bus. Power should also be okay. Both PSU and battery are new. ACPI reports battery was last charged to 98% of capacity.

I believe I’ve ruled out the DE as a source of the problem. I typically use Budgie (Gnome). I’ve tested with Gnome Shell, LXQT and TWM. The problem remains reproduceable in all three desktop environments, as well as with a window manager only.
dmidecode.txt (24.7 KB)

Hi soseihin,

I had a very similar problem on my setup, which I ultimately determined to be caused by malfunctioning hardware. More specifically, I froze all system updates (Kernel, NVidia drivers and software - i.e. I ran no pacman) - and the issue went away be re-seating my graphics card and RAM. For details, see this thread.

There was some other user with a similar problem, though I don’t know if he ever got the problem fixed.

I don’t think this makes certain that it is certainly a HW issue in your case - but just my 2 cents…

Thanks Wild_Penguin. I’d love to know if Rattlewrench ever figured out his issue in that second post you linked me to. I think I recall having discovered your post fairly early in my troubleshooting. Your post, along with a few I found in the Arch forums seemed to point to the possibility of it being a hardware problem, which I increaseingly believe it is. But it’s still entirely likely that I’m over looking something obvious.

Looks like some MB issue, have a look at the official owner’s forum:
http://forum.notebookreview.com/threads/the-official-msi-16l13-eurocom-tornado-f5-owners-and-discussions-lounge.797128/page-194
Maybe hook up with those people as one has RMA’d twice and he only got a new gpu which didn’t solve the problem.

Thanks, generix. Yeah, it looks like several people are having issues with the same hardware I am. I’ve started talking to Eurocom support about how to proceed. I’ll update this thread with a solution should they provide one. In the meantime I would still very much appreciate any ideas or information this community has to offer, and thank you all again for your help so far.

Might be interesting to have an output from nvidia-smi while the GPU is still working to see if autoboost is available and enabled. Then maybe disable it and see if the GPU is still falling off the bus.
AFAIK there’s no way to set the GPU to minimum performance thus limiting maximum power draw. There’s a feature request somewhere though.

Edit: seems frequency manipulation can be achieved using CoolBits:
http://www.phoronix.com/scan.php?px=MTY1OTM&page=news_item

Thanks for the new ideas, nvidia-smi gives me this report:

hotbox% nvidia-smi --auto-boost-default=1
Enabling/disabling default auto boosted clocks is not supported for GPU: 0000:01:00.0.
Treating as warning and moving on.
All done.

I tried to follow the directions outlined by phoronix, but I don’t use a xorg.conf file becaue it always seems to break X for me. So unfortunetaly I until I get a working xorg.conf I’m unable to add the coolbits option to my X config.

I went ahead and dumped the output of nvidia-smi -qi into the attached file, incase you care to take a look at it.

A small update:
Eurocom got back to me, suggesting that the latest nvidia drivers are buggy and to use the drivers they have available for download. Unfortunately they only provide the windows driver, but according to a post I found in the notebook review forum you had previously directed me to, they’re using 368.79 drivers. I found 367.XX and 370.XX drivers are still available for download from nvidia, so I’ll try those out and report back.
nvidia-smi-qi.txt (6.09 KB)

Thanks again for the advice. I managed to set up a functioning xorg.conf and enabled coolbits. Unfortunately underclocking doesn’t appear to make a difference, symptoms and error logs still remain the same.

I was unable to get the 367.44 driver to build, it fails with:

...

/tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel/nvidia-drm/nvidia-drm-modeset.c: In function ‘nvidia_drm_atomic_commit’:
/tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel/nvidia-drm/nvidia-drm-modeset.c:678:34: error: passing argument 1 of ‘drm_atomic_helper_swap_state’ from incompatible pointer type [-Werror=incompatible-pointer-types]
     drm_atomic_helper_swap_state(dev, state);
                                  ^~~
In file included from /tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel/nvidia-drm/nvidia-drm-modeset.c:37:0:
./include/drm/drm_atomic_helper.h:75:6: note: expected ‘struct drm_atomic_state *’ but argument is of type ‘struct drm_device *’
 void drm_atomic_helper_swap_state(struct drm_atomic_state *state,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
  LD [M]  /tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel/nvidia-modeset.o
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:289: /tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel/nvidia-drm/nvidia-drm-modeset.o] Error 1
  LD [M]  /tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel/nvidia-uvm.o
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1473: _module_/tmp/selfgz647/NVIDIA-Linux-x86_64-367.44/kernel] Error 2
make[1]: Target 'modules' not remade because of errors.
make[1]: Leaving directory '/usr/lib/modules/4.8.13-1-ARCH/build'
make: *** [Makefile:81: modules] Error 2
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

I did, however, get the 370.28 driver to install though it also fails in the exact same way, with the exact same errors. The error log is attached. I’ll report all this new information to Eurocom and see what they have to say and thanks again for your continued assistance.
nvidia-bug-report.txt (562 KB)

To avoid needless shipping fees and to rule out a Linux/driver problem entirely, I installed Windows 10. After several days of testing I seem to be able to run 3D games without any problems. Not sure where to go from here as it doesn’t appear to be a hardware issue after all.

Hi soseihin, We would like to reproduce this issue internally to debug further.
Could you please provide reproduction steps in step-by-steps manner?
You mentioned multiple games crash issue. So for every game you are getting same error in log/dmesg ?
Can you provide repro steps for one or two games?
How long need to play play?
Is the issue repro on specific MAP in game?
What action trigger this issue? Is all game patch/updates applied?
Please provide crash dump or backtrace when game crashes?
Did you see this issue on any other OS like Ubuntu/Fedora ?
did you test with 378.09 driver?
Any older driver help you to resolve this issue?
Any customer setting done in steam or game?
What is the resolution of game and display?
I think you OS in uefi/efi mode, Please share o/p of dmidecode command?

Please provide as much as info about you hardware/software setup and repro details that will help to replicate exact same environment here to try reproduction of this issue.

I have the same problem with an EVGA gtx1080 on ubuntu 16.10. I discard a hardware problem too, so far this problem only happens to me when playing XCOM2 an Victor Vran, I played other games like Deus Ex: Mankind divided and Total War: Warhammer for hours without problems. I tested this with 375.39 and 378.13 drivers.

I have the same problem with the GTX1080 on Arch Linux. I use the 384.69 driver version and the problem appears after some minute in XCOM2/Alien Isolation. XCOM works fine.

I’am experiencing the same problem with Eurocom Sky X7E2 and GTX 1080. I have tried drivers 384.111, 387.34 and 390.25 (all available from Ubuntu repositories), it made no difference. I have problems with Unigine Valley, Total War Warhammer, War for the Overlord, virtually all more demanding games I’ve tried. Less demanding ones are fine (Minecraft). Surprisingly, Unigine Superposition benchmark finished wihtout problem. But it may be coincidence only.

Output of dmidecode and nvidia-bug-report attached. Both created after system restart. I probably can create them before restart through ssh, if necessary.
dmidecode.txt (15.8 KB)
nvidia-bug-report.log.gz (154 KB)

I have the same problem, after 5-8 minutes of gaming, NVIDIA falls from the bus :/
it happened in 4 different games so far.

Dell XPS 15 7590 (GTX 1650), newest drivers: nvidia-driver-435, ubuntu 18.04 LTS