I have an issue with GTX 970 (And most likely with TX 660 as well, need to verify that) on Arch Linux.
When playing games it crashes after a while. This issue was introduced sometime during the summer, can’t pinpoint the exact time, but it hasn’t been like that for a long time.
I am able to ssh into my machine after the crash so it is only GPU which is crashing. I have also ran memtest for hours as well as kept cpu under max load for hours with no problems. But then playing games for 10 +/- 5 minutes leads to a crash.
Below is report from nvidia-bug-report after the crash, but it wasn’t ran on logverbose 6, I will need to do that to see if it provides more information. After that for comparison same report ran after clean boot.
Also, at the moment I am running LTS Kernel as a test when trying to fix this.
Checked the temperature. Last reported temp prior to crash was 71 degrees. I am also attaching crash report on logverbose 6. nvtemp.log (474 KB) nvidia-bug-report.log.gz (253 KB)
I haven’t tried reseating. Will do that. I will also double check with another card because I am 99% certain that I can duplicate with it as well because I did have this issue to lesser extent prior to switching this current card.
Actually I doubt that it is hardware issue (and this setup was assembled last spring, not enough time for dust to really accumulate).
Today I was able run this with load for over 4 hours when I had lts kernel (4.14) and legacy (390) driver installed. Then when switching both to current it crashed again on same load.
Randomly get black screen even without any GPU load (while browsing chrome), in kernel logs:
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: GPU Board Serial Number: 0420715036976
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: GPU at 00000000:02:00.0 has fallen off the bus.
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: GPU is on Board 0420715036976.
сен 12 11:53:51 dad-linux64-nvme kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Make sure you have proper thermal. This kind of issue normally due to GPU temperature.
Hi p.m.kinnari,
>>Then when switching both to current it crashed again on same load.
What change you made to the system here?
Did you test with 390.87 driver? What game are you playing? How long need to play the same to hit this issue? Is there any custom setting in-game settings? What is the game and desktop resolution? What Desktop environment you are running - kde, gnome, xfce or else? Are the desktop effects enabled? Do you have any other system to test? See if you can repro with other GPUs too. Also is this issue hit in spefic MAP in the game and specific action in game?
Hi p.m.kinnari, Any update about requested information? I think you are facing this issue while playing a game only.
HI dad_gf , I think you are facing this issue without playing the game also, right? Make sure you have proper thermal. This kind of issue normally due to GPU temperature.
GPU temp is OK: I played the same title (ROTR) on windows 10 for 2 hours - no problem. I also monitored my GPU temp - it’s OK. I face the issue(different Xids, for example 8, 16) on Linux in some kinds of (even light) GPU loads, not only games: this may be Chrome Browser or even Google Earth -screen goes blank during these Xids. But mining loads (near 100%) for example work for 10 hours without any issue. This maybe NV driver or ubuntu 18.10 (unstable version) bug?
I went back to current driver and kernel. So I have tested it with 4.14 and 4.18 kernels as well as 390 and 396 drivers. Both kernels with both driver versions.
I wiped my steam folder recently and at the moment I have only played Wargame: Red Dragon. Playing time in single player campaign is 5-10min until crash. No custom settings. And crash with similar symptoms also happens with for example World of Warships (via wine).
Resolution is 1920x1080.
I am using Awesome, I do have KDE installed as but I haven’t tried with it.
Today i finally swapped my 660 back in and I was able to play longer. And also the crash was different. Instead of signal going missing and display going to standby it froze completely. And the system doesn’t respond to ping (earlier I was able to ssh in and run the bug reporting tool.
I will attach journal from the boot before the crash but there isn’t anything prior the boot there. I also did run the bug reporting tool after the boot for what it is worth. nvidia-bug-report.log.gz (109 KB) journal.txt (135 KB)
Swapped in new AMD card today. Ran it for longer than I remember being able without any issues.
But, while installing that card I noticed that I had old-old xorg.conf which my system was using. And I began thinking if that might be a cause for my problems. So I will next swap in again that 970 to see if situation is still the same.
Aand the crashes are back. Interestingly I got different error from Wargame: Red Dragon and Bioshock Infinite. RD still has the old “Fallen off the bus” but Bioshock has a long list of Graphics exceptions. Observable result was the same though. Meaning that screen went to standby but I was able to ssh in and generate the bug report, which is attached to this message.
I also ran 1h long test with this: Multi-GPU CUDA stress test but wasn’t able to recreate the crash. Although I think I need to run a longer one as it took longer than that to make Bioshock to crash.