System turning off even after power supply replaced with new one

About few weeks ago, my system using dual 1070s (nvidia-driver-515) started powering itself down when I ran the same CUDA code I’d been running for months. It is as though I had held the power button down (or turned the switch off on the power supply). I bought a new power supply and installed it. No difference. I took 1070s out and put in a 1030, and it works fine except I can’t run both monitors). So I then added a K80 I had lying around and, after reverting to the legacy driver (nvidia-driver-470) supporting it, I ran the same CUDA code. Power went off. The same CUDA code runs ok (but slowly) on the 1030.

The code also runs OK in CPU only mode (consuming all processors for hours on end).

I’m guessing its a fried PCI-e chip on the motherboard, but can’t be certain it wasn’t a software update for Ubuntu upon which the CUDA drivers rely.

Ubuntu 22.4LTS
AMD Ryzen 7 3700x8-core processor x16

PS: It does seem to be getting worse as I can trigger the power down by reinstalling a 1070 and not even run the CUDA code – just use the X11 graphics capability of the nvidia driver.

An instant shutdown is only triggered by the mainboard, independant of software.

I replaced the motherboard and CPU and it still powers off with anything except the 1030. Since the K80 wasn’t in the prior system, that narrows things down to RAM. However, running the Linux kernel memory test shows no errors.

So, this is getting rather interesting.

Defective RAM shouldn’t trigger the mainboard to power down.
Which brand/model is the new PSU?

Same brand/model as the old PSU as I had no reason to believe there was a problem specific to that brand/model and I wanted to minimize the work to recable the system.

Another data point: I moved one of the 1070s to an old Windows 10 system, updated the drivers and ran the same program that initially had exposed the problem on the Ubuntu system. It runs without any problem.

Another really interesting data point:

The system doesn’t fail if I boot to tty and avoid using any graphics while running the CUDA program.

I even sought to maximize power utilization by changing the program’s parameters to use more of the GPU memory (it had always maxed out at 100% of the CUDA cores) and running it for a lot longer than usual. I thus far have not been able to get the system to power itself down.

I then discovered this hardware configuration would work just fine if I did not boot into runlevel 3 (ie: tty) and ran the GUI as usual! The 550 can’t run anything but VGA mode since the supporting legacy driver isn’t selected so it only participated in the boot process until the GUI started running on both my HDMI monitors.

Just to make sure the problem wasn’t one of the PCI slots I swapped the GPUs PCI slots, rebooted to the GUI and it still works – however now the 550 is contributing nothing – not even during boot – no VGA signals – because Ubuntu selects the GPU nearest the CPU for Xorg.

The 550 shares the PCI power cable with the 1070 (as did the other 1070 when it was in use).

So, feeling lucky I removed the 550 from the system and the system still works. Now I’m wondering if I can make it fail at all anymore. Put the 1030 back into the secondary PCI slot. Still works. Plugged swapped the PCI-e power plug to see if that had any affect. Still works. Plugged one monitor into the 1030 and one into the 1070. Still works.

Replaced the 1030 with the second 1070, but left only one monitor connected (to the primary PCIe slot) and the program runs on either of the GPUs without crashing.

Hooked both monitors up, one to each 1070. Still works.

Run two copies of the CUDA code, one assigned to each 1070. All GPU cores i nuse. Temp 72C and 68C. Fans 40% and 50%. Still works.

Run CPU version at the same time putting CPU usage (all cores) at >80%. Still works.

PS: I was unable to exclude Xorg processes from the 1070 by /etc/X11/xorg.conf etc. I had to modify GRUB to exclude Xorg entirely. Moreover, I had to replace the 1030 with a 550 and hook up a VGA monitor so I could interact with the tty.