I have this intermittent instability problem. Sometimes the GPU will crash with Xid error 16 in the logs (see the attached nvidia bug reports).
The hang happends quite intermittently, and is a bit difficult to reproduce. If I do not run graphically demanding applications - or only lightly demanding - it might happen only once every 240h of usage (i.e. maybe once a month???). But with XCOM: Enemy Unknown (which is not that demanding, IMO) it happens more frequently, say once every hour.
While trying to rule out defective hardware I have:
- tried to set the fan to a constant, high speed (50%, which is way higher RPM than it normally would ramp up to in any scenario) and
- Underclock the GPU with -100MHz offset
This is why I have “Coolbits” enabled currently in Xorg.conf, in case you are wondering. I have not overclocked the GPU, and the hangs happened before I enabled it - or, to put it more correctly, I only enabled it while trying to rule out thermal / HW issues.
With underclocking (-100MHz) a different error is produced. With the stock clock, I first get Xid error 16 in xorg.log, but with the slight underclock, I get “GPU has fallen of the bus”.
However, the end result (from user perspective) is the same in both cases: the GPU will hang, with a black (sometimes a dark hue - blue or purple - for 2-10 seconds) screen, along with X.org and all processes under X.org (also, no switching to VCs anymore, SysRQ does still work). After a while (less than a minute) my TV will say “not connected”. I can log in via SSH and usually (95% of the time) shut down the system gracefully (the shutdown will take ages since the system waits for the user processes under X.org to stop, but they are in a broken state), but even the shutdown is not 100% reliable. If I let the system run after the GPU has hung, I believe the whole system (kernel?) will hang after perhaps 10-120minutes, after which log in by SSH is no longer possible.
I do not think I have ruled out defective hardware, but I think it was more stable with an older version of the driver (don’t know which, since I do gaming only intermittently, and if I do not strain the card, it is quite stable). I believe the card is still under warranty, so if you have any tips how to determine if this might be faulty hardware after all, that would be appreciated, too!
Forgot to tell my system details:
- Arch Linux, same issue with several different kernels (I have tried stock=4.7.4, zen=4.7.3 and lts=4.4.21 branches in Arch)
- EVGA GTX 970 (04G-P4-3975-KR)
- ASUS Maximus VII Gene + i7-4790k + 16GB RAM
- Nvidia 370.28 - but at least the previous version was affected, too!
EDIT: some minor wording edits, also made the experienced behaviour description more precise
Attached some nvidia-bug-report.sh outputs:
- The one with "normalclock-hang" -prefix is with stock settings. Xid error 16, along with other errors...
- "underclocked-fallofbus" is with the -100MHz underclock. (GPU falls of the bus)
- "ok" - output from a run of the script while the system is seeming to run normally (before crash).