About few weeks ago, my system using dual 1070s (nvidia-driver-515) started powering itself down when I ran the same CUDA code I’d been running for months. It is as though I had held the power button down (or turned the switch off on the power supply). I bought a new power supply and installed it. No difference. I took 1070s out and put in a 1030, and it works fine except I can’t run both monitors). So I then added a K80 I had lying around and, after reverting to the legacy driver (nvidia-driver-470) supporting it, I ran the same CUDA code. Power went off. The same CUDA code runs ok (but slowly) on the 1030.
The code also runs OK in CPU only mode (consuming all processors for hours on end).
I’m guessing its a fried PCI-e chip on the motherboard, but can’t be certain it wasn’t a software update for Ubuntu upon which the CUDA drivers rely.
System:
Ubuntu 22.4LTS
AMD Ryzen 7 3700x8-core processor x16
PS: It does seem to be getting worse as I can trigger the power down by reinstalling a 1070 and not even run the CUDA code – just use the X11 graphics capability of the nvidia driver.