GPU has fallen of the bus error

Hello there, I’ve been using my 4090 with no problem for around half a year, but suddenly the GPU started to throw “GPU has fallen of the bus error” during training [transformers, power draw from 200W to 450W depending on the size of the model, and bunch of other models]/or during idle state [no particular moment, but pretty early, like 2-3 mins into the launch]

Things I’ve tried:

  1. sudo nvidia-smi -pl 300 or same with 250 - didn’t help

Here are the logs, - I’m attaching the monitored temps,
[before and after the crash, they seem pretty normal to me],
we can observe here

  1. the state before the launch of the training (temperatures hanging around 40 degrees)
  2. during training (~60 degrees)
  3. crashed

also I attached nvidia-bug-report.sh

  1. running before the crash [in normal state, gpu is working fine]
  2. after the crash [calling nvtop shows no GPU, memory and utilization are both N/A]

I would like to solve the problem [how do I prevent my GPU from crashing ever?].
I’m running pretty expensive training runs spanning days sometimes, I would love to continue to use my GPU non-stop :D

If it’s important, when GPU dies, fans go to 100% [they’re so loud] and I can’t reboot the system from terminal, when I type sudo reboot, nothing happens and I can continue to use the system as usual, I have to manually use button on my PC.

My power supply is 1200W Platinum SuperFlower, I’m running Ryzen 9 5950x in the system, so I can’t imagine it’s the power draw problem.

nvidia-bug-report.log.gz (396.0 KB)
temp.log (112 KB)
nvidia-bug-report-crashed.log.gz (191.2 KB)

The nvidia gpu is turned off, this points to a power issue. While the total wattage of the psu should be sufficient, ML workloads produce heavy power spikes which some psus detect as short circuit.
To make it more stable, you can try limiting clocks, e.g.
nvidia-smi -lgc 300,1500
or get a different psu model.

Since this is a 4090, please also check power connectors.

Thank you very much for a fast reply!

I indeed have a 2000W PSU (same SuperFlower Platinum) laying around since I wanted to setup a 2x4090 server, I’ll try to change that first, since I don’t really want to limit the GPU clocks [but I’ll try that too to see if the issue pesists] :D

Regarding the connector - it visually looks allright, I’ve tried to tingle it a bit, and it looks steady.
I’ll report back when I conduct my experiment regarding changing clock limits and changing power supply.

I’ve just tried limiting clocks as you suggested to (300, 1500), - it didn’t work.
Expected, since sometimes GPU falls of the bus even when idling [power consumption is low].

Probably PSU issue all the way.

UPD: I removed left wall from my case (which is Lian Li O11 Dynamic XL, widest I was able to find on the market), since my power cable bended a bit, now the system works fine I don’t know for how long, lets see if it helps

Did it fail since you “fixed” the cable?