Symptoms: When under load the graphics cards crashes and brings down the graphics system with it. The screen pixelates, and the Xorg process starts hogging 100% of one CPU core. The OS is still alive - it’s still possible to ssh in and e.g. run the nvidia-bugreport script, but a reboot is necessary to recover any desktop functionality.
The problem does not seem to be overheating, given that 57 degrees is the highest I’ve seen before the crash happens, but it can happen even at temps below 50. It’s also not the PSU, which I first suspected, because the problem remains even with a new Seasonic. I’ve also run memtest, and CPU burn in to rule out those culprits. The system is rock solid as long as the GPU is not stressed.
The crash is reproducible both under Devuan ascii (driver 284) and Ubuntu bionic (driver 290). Respective bug report logs are here: https://miria.homelinuxserver.org/private/nvidia-bug-report.log.devuan.gz
and here: https://miria.homelinuxserver.org/private/nvidia-bug-report.log.ubuntu.gz
For these tests I’ve used furmark from the phoronix test suite to stress the GPU, but the source doesn’t really seem to matter - only the stress level. When gaming it’s pretty unpredictable how long it hangs on before crashing - it depends on the game and how intensive the scenes are. When doing a GPU benchmark, it typically succumbs in under 30 seconds.
I’ve tried adding an xorg.conf with coolbits, to see if manually setting fan speed to max would help but that doesn’t seem to be the case. (The Devuan logs are from this failed experiment.) Any help & suggestions would be appreciated.