I’ve installed NVIDIA-SMI 530.30.02 on Ubuntu 20 and running 4x RTX3090 on HSL12SSL motherboard with EPYC7402.
nvidia-bug-report.log.gz (1.2 MB)
It appears pytorch continously crash the system. System will run crypto mining or hashcat without any problem but pytorch seems to be repeatly causing a crash/reboot. I have hardtime understand what could be a cause of this. I’ve attached nvidia-bug-report.log.gz and I will highly appreciate opinion and feedback on this matter.
You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500
I really appreciate your response on this issue. I did have changed a power supply a few hours ago and that appears to be fixing a problem. I am still testing but so far it is running okay.