GPU has fallen off the bus | GPU crashes after a while under load (ie. playing games)

I have an issue with GTX 970 (And most likely with TX 660 as well, need to verify that) on Arch Linux.

When playing games it crashes after a while. This issue was introduced sometime during the summer, can’t pinpoint the exact time, but it hasn’t been like that for a long time.

I am able to ssh into my machine after the crash so it is only GPU which is crashing. I have also ran memtest for hours as well as kept cpu under max load for hours with no problems. But then playing games for 10 +/- 5 minutes leads to a crash.

Below is report from nvidia-bug-report after the crash, but it wasn’t ran on logverbose 6, I will need to do that to see if it provides more information. After that for comparison same report ran after clean boot.

Also, at the moment I am running LTS Kernel as a test when trying to fix this.

Hmm… code blocks are not working for some reason. But see that I can attach files as well.
nvidia-bug-report.log_after-crash.gz (228 KB)
nvidia-bug-report.log_after-boot.gz (103 KB)

Check temperature while playing, e.g.
nvidia-smi -q -l 3 -d TEMPERATURE >nvtemp.log

Checked the temperature. Last reported temp prior to crash was 71 degrees. I am also attaching crash report on logverbose 6.
nvtemp.log (474 KB)
nvidia-bug-report.log.gz (253 KB)

Cooling is perfectly fine. Did you already try to reseat the card? Checking another with psu would also be an option.

I haven’t tried reseating. Will do that. I will also double check with another card because I am 99% certain that I can duplicate with it as well because I did have this issue to lesser extent prior to switching this current card.

Then you should check your psu, maybe dusty so it heats up more than healthy and isn’t able anymore to provide enough power.

Actually I doubt that it is hardware issue (and this setup was assembled last spring, not enough time for dust to really accumulate).

Today I was able run this with load for over 4 hours when I had lts kernel (4.14) and legacy (390) driver installed. Then when switching both to current it crashed again on same load.

The same for me on GF GTX Titan X (maxwell):

the latest nvidia driver 396.54
ubuntu 18.10 (dev version), kernel 4.18.0.7

Randomly get black screen even without any GPU load (while browsing chrome), in kernel logs:

сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: GPU Board Serial Number: 0420715036976
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: GPU at 00000000:02:00.0 has fallen off the bus.
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: GPU is on Board 0420715036976.
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                            NVRM: nvidia-bug-report.sh as root to collect this data before
                                            NVRM: the NVIDIA kernel module is unloaded.

temp is OK under load. Some linux driver bug?

Ah, I was able to crash it again with lts kernel and legacy drivers so it doesn’t seem to be about version either.

Tried reseating GPU. And confirmed that it still crashes. Lot attached.

I am all out of ideas.
nvidia-bug-report.log.gz (246 KB)

Did you at some point check if the GTX660 still runs fine? Tried the 970 in another system?

Make sure you have proper thermal. This kind of issue normally due to GPU temperature.

Hi p.m.kinnari,

>>Then when switching both to current it crashed again on same load.
What change you made to the system here?

Did you test with 390.87 driver? What game are you playing? How long need to play the same to hit this issue? Is there any custom setting in-game settings? What is the game and desktop resolution? What Desktop environment you are running - kde, gnome, xfce or else? Are the desktop effects enabled? Do you have any other system to test? See if you can repro with other GPUs too. Also is this issue hit in spefic MAP in the game and specific action in game?

BTW I also reproduced Xid 8, 38

NVRM: Xid (PCI:0000:02:00): 8, Channel 00000036
NVRM: Xid (PCI:0000:02:00): 38, 0008 0000b197 00000000 00000000 00000000 00000000

After 1-2 min running Steam linux game: ROTR. I guess 396.54 driver is still not stable. Have to switch to 390.xx release?

Hi p.m.kinnari, Any update about requested information? I think you are facing this issue while playing a game only.

HI dad_gf , I think you are facing this issue without playing the game also, right? Make sure you have proper thermal. This kind of issue normally due to GPU temperature.

GPU temp is OK: I played the same title (ROTR) on windows 10 for 2 hours - no problem. I also monitored my GPU temp - it’s OK. I face the issue(different Xids, for example 8, 16) on Linux in some kinds of (even light) GPU loads, not only games: this may be Chrome Browser or even Google Earth -screen goes blank during these Xids. But mining loads (near 100%) for example work for 10 hours without any issue. This maybe NV driver or ubuntu 18.10 (unstable version) bug?

Temperatures haven’t gone up.

I went back to current driver and kernel. So I have tested it with 4.14 and 4.18 kernels as well as 390 and 396 drivers. Both kernels with both driver versions.

I wiped my steam folder recently and at the moment I have only played Wargame: Red Dragon. Playing time in single player campaign is 5-10min until crash. No custom settings. And crash with similar symptoms also happens with for example World of Warships (via wine).

Resolution is 1920x1080.

I am using Awesome, I do have KDE installed as but I haven’t tried with it.

Today i finally swapped my 660 back in and I was able to play longer. And also the crash was different. Instead of signal going missing and display going to standby it froze completely. And the system doesn’t respond to ping (earlier I was able to ssh in and run the bug reporting tool.

I will attach journal from the boot before the crash but there isn’t anything prior the boot there. I also did run the bug reporting tool after the boot for what it is worth.
nvidia-bug-report.log.gz (109 KB)
journal.txt (135 KB)

Swapped in new AMD card today. Ran it for longer than I remember being able without any issues.

But, while installing that card I noticed that I had old-old xorg.conf which my system was using. And I began thinking if that might be a cause for my problems. So I will next swap in again that 970 to see if situation is still the same.

I now swapped the 970 back in so far I have not been able to crash it.

Could it really be due to that old xorg.conf that caused these crashes?

Aand the crashes are back. Interestingly I got different error from Wargame: Red Dragon and Bioshock Infinite. RD still has the old “Fallen off the bus” but Bioshock has a long list of Graphics exceptions. Observable result was the same though. Meaning that screen went to standby but I was able to ssh in and generate the bug report, which is attached to this message.

I also ran 1h long test with this: Multi-GPU CUDA stress test but wasn’t able to recreate the crash. Although I think I need to run a longer one as it took longer than that to make Bioshock to crash.

As a test I also ran half an hour test of gpu-burn and http://systester.sourceforge.net/ running simultaneously but no crash there either. Temps were stable, around 71 for gpu and 52 for cpu.
nvidia-bug-report.log.gz (1.08 MB)

You’re getting the XiD 79 with three different gpus, so it’s likely that your psu is flawed. Ever tried replacing it?