I’ve been trying to troubleshoot 3d games resulting in the GPU falling off the bus. I’ve run out of avenues to explore and am looking for any other suggestions of what I should look into before deciding to call this a hardware problem and pursue an RMA.
Dmesg output:
[ 189.427267] NVRM: GPU at PCI:0000:01:00: GPU-73236338-bf17-442f-b881-d785485aa3bf
[ 189.427287] NVRM: GPU Board Serial Number:
[ 189.427290] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 189.427296] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 189.427312] NVRM: GPU is on Board .
[ 189.427325] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 204.377661] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 204.378782] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 204.379516] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 204.380177] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
Background details:
Eurocom Toronado F5 (MSI 16L13), i7-6700 cpu, GTX1070 gpu
hotbox% uname -a
Linux hotbox 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 07:24:34 CET 2016 x86_64 GNU/Linux
hotbox% pacman -Ss nvidia | grep installed
extra/libvdpau 1.1.1-2 [installed]
extra/libxnvctrl 375.26-1 [installed]
extra/nvidia 375.26-1 [installed]
extra/nvidia-libgl 375.26-2 [installed]
extra/nvidia-settings 375.26-1 [installed]
extra/nvidia-utils 375.26-2 [installed]
multilib/lib32-nvidia-libgl 375.26-2 [installed]
multilib/lib32-nvidia-utils 375.26-2 [installed]
hotbox% lsmod | grep nvidia
nvidia_drm 49152 1
nvidia_modeset 782336 4 nvidia_drm
nvidia 11870208 65 nvidia_modeset
drm_kms_helper 126976 1 nvidia_drm
drm 294912 4 nvidia_drm,drm_kms_helper
Symptoms:
Running 3d games inevitably causes the gpu to fall off the bus, resulting in a blackscreen and the inability to use directly connected input devices (keyboard, mouse). Any background music continues to play. GPU temps remain between 40 and 60.
Running “The Long Dark” through the native Linux Steam client allows playability while remaining in interior locations. A crash will typically occur within a few minutes of entering an outside location, though on one occasion I was able to start a new game and play for roughly an hour.
Running “Insurgency” through Steam crashes shortly after the map has finished loading, though again there was an occasion where I was able to play longer.
When I run “Drunken Robot Pornography” or “Ziggurat” through Steam and “Mass Effect” through WINE, I get substanially longer game play - up to several hours on a stretch in “Mass Effect.”
I have yet to experience a crash in a 2d game, but haven’t put a lot of time into testing them. Day to day work with office tools, web browsing and media playback are all fine.
Troubleshooting steps:
I am able to start an SSH session, which I’ve used to collect the nvidia bug report and output of dmesg, journalctl -xe and Xorg.0.log immediately after a crash. (All should be attached)
After a crash nvidia-smi -r reports that the gpu is unable to be restarted and the system must be rebooted.
Using the Nvidia Settings utility to set perfomance to maximum and nvidia-smi to toggle persistance mode on/off has not made a difference. It appears I am unable to turn off ECC mode for testing purposes.
Previous logs mentioned ‘irq 16: nobody cared (try booting with the “irqpoll” option)’ immediately before the crash. Adding the irqpoll option as suggested continues to result in the crash and yeilds lots of messages about hpet losing large amounts of rtc interupts leading up to and after the crash. Adding the hpet=disable option fixes them, but still doesn’t solve the problem.
Nouveau seems to work, but yeilds one frame per second in (admittedly not comprehensive) testing so it’s not a feasible solution.
I found the following thread reporting very similar hardware and symptoms:
https://devtalk.nvidia.com/default/topic/984339/linux/gtx-1070m-on-clevo-p650rs-falling-off-the-bus/
It made the most sense for me to start a new thread, but perhaps the similarities warrant a merge.
Thank you for any help you can offer.
nvidia-bug-report.log.gz (269 KB)
dmesg.txt (89.4 KB)
journalctl.txt (97.8 KB)
xorgLog.txt (31.8 KB)