GPU has fallen off the bus - GTX 1070 - nvidia-gfxG04-kmp-default-390.87 [Solved - dead GPU]

This suddenly started happening - very repeatedly.

Drivers were installed two weeks ago. Crashes started happening today. No HW changes.

Happens with or without load.

[ 7014.746693] tun: Universal TUN/TAP device driver, 1.6
[ 7014.747016] br0: port 2(vnet0) entered blocking state
[ 7014.747017] br0: port 2(vnet0) entered disabled state
[ 7014.747046] device vnet0 entered promiscuous mode
[ 7014.747155] br0: port 2(vnet0) entered blocking state
[ 7014.747156] br0: port 2(vnet0) entered forwarding state
[ 7014.946451] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html for details.
[ 7151.601927] NVRM: GPU at PCI:0000:01:00: GPU-e9ab817b-191c-2aec-03b4-4d1b3a7883b3
[ 7151.601932] NVRM: GPU Board Serial Number:
[ 7151.601934] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 7151.601939] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 7151.601940] NVRM: GPU is on Board .
[ 7151.601950] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[ 7151.601977] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f
[ 7292.640054] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000987d:0:0:0x0000000f
[ 7292.640064] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000917e:0:0:0x0000000f
[ 7292.640072] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f

nvidia-bug-report.log.gz (303 KB)

Are you also facing this issue while playing the game? See https://devtalk.nvidia.com/default/topic/1039096/linux/gpu-has-fallen-off-the-bus-gpu-crashes-after-a-while-under-load-ie-playing-games-/

This is kind of issue is normally due to thermal. Make sure your GPU is not overheated and had proper cooling.

I’ll start doing a temp log and see what it says.

nvidia-smi -q -l 3 -d TEMPERATURE >nvtemp.log

I’ll attach that after the next crash.

Thanks.

Crashed again, does not appear to be thermal related. Was 60+ C while I was playing a game, I left to do something else and came back to a crash.

==============NVSMI LOG==============

Timestamp                           : Mon Sep 17 09:10:46 2018
Driver Version                      : 390.87

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Temperature
        GPU Current Temp            : 53 C
        GPU Shutdown Temp           : 99 C
        GPU Slowdown Temp           : 96 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

==============NVSMI LOG==============

Timestamp                           : Mon Sep 17 09:10:49 2018
Driver Version                      : 390.87

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Temperature
        GPU Current Temp            : GPU is lost
        GPU Shutdown Temp           : GPU is lost
        GPU Slowdown Temp           : GPU is lost
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

dmesg output:

[ 5136.575433] NVRM: GPU at PCI:0000:01:00: GPU-e9ab817b-191c-2aec-03b4-4d1b3a7883b3
[ 5136.575436] NVRM: GPU Board Serial Number:
[ 5136.575437] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 5136.575440] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 5136.575440] NVRM: GPU is on Board .
[ 5136.575447] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[ 5137.519220] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000927c:0:0:0x0000000f

Hi dane.buson,
What game are you playing? How long need to play the same to hit this issue? Is there any custom setting in-game settings? What is the game and desktop resolution? What Desktop environment you are running - kde, gnome, xfce or else? Are the desktop effects enabled? Do you have any other system to test? See if you can repro with other GPUs too. Also is this issue hit in spefic MAP in the game and specific action in game?

Team Fortress 2. This happens in game, or just sitting at the desktop straight from boot never launching a game.

This is not game related.

The other systems I have are not running the same OS. Desktop effects are enabled.

It has occurred about 8 times today. I can see if I still have a 970 I can swap in next time it crashes.

>> It has occurred about 8 times today. I can see if I still have a 970 I can swap in next time it crashes.
So I good to check if its GPU or another hardware issue. Also try with different nvidia driver version to check if its driver issue. It good to contact GPU vendor to check GPU hardware issue.

>> This is not game related.
Can you please find out what activities hit this issue?

I’ve replaced it with a GTX 970 - I’ll see if it crashes today - if not I’d say it’s a hardware issue.

Okay. Keep us posted.

This can be closed out - no crashes after replacing with another GPU. I just got my RMA replacement in the mail from EVGA.

Thanks for your help.