Hello there, I’ve been using my 4090 with no problem for around half a year, but suddenly the GPU started to throw “GPU has fallen of the bus error” during training [transformers, power draw from 200W to 450W depending on the size of the model, and bunch of other models]/or during idle state [no particular moment, but pretty early, like 2-3 mins into the launch]
Things I’ve tried:
sudo nvidia-smi -pl 300
or same with 250 - didn’t help
Here are the logs, - I’m attaching the monitored temps,
[before and after the crash, they seem pretty normal to me],
we can observe here
- the state before the launch of the training (temperatures hanging around 40 degrees)
- during training (~60 degrees)
- crashed
also I attached nvidia-bug-report.sh
- running before the crash [in normal state, gpu is working fine]
- after the crash [calling nvtop shows no GPU, memory and utilization are both N/A]
I would like to solve the problem [how do I prevent my GPU from crashing ever?].
I’m running pretty expensive training runs spanning days sometimes, I would love to continue to use my GPU non-stop :D
If it’s important, when GPU dies, fans go to 100% [they’re so loud] and I can’t reboot the system from terminal, when I type sudo reboot
, nothing happens and I can continue to use the system as usual, I have to manually use button on my PC.
My power supply is 1200W Platinum SuperFlower, I’m running Ryzen 9 5950x in the system, so I can’t imagine it’s the power draw problem.
nvidia-bug-report.log.gz (396.0 KB)
temp.log (112 KB)
nvidia-bug-report-crashed.log.gz (191.2 KB)