uname: Linux h2 5.10.215 #1-NixOS SMP Sat Apr 13 10:59:59 UTC 2024 x86_64 GNU/Linux
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.78 Sun Apr 14 06:35:45 UTC 2024
GCC version: gcc version 13.2.0 (GCC)
Hi! I have a machine learning workflow I’m running on a 3090.
Sometimes though, the 3090 falls off the bus and the fan starts spinning very loudly. At that point, nvidia-smi returns:
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
I originally read on the internet that this could be caused by power spikes, so I bought a larger PSU (1000W), but the problem persists.
I’m also fairly confident it’s the card’s fault, because I have a smaller 3070 in which this kind of fail never happened (even when installed in the same PCIe slot).
I include my nvidia-bug-report.log
I would like your suggestion in:
understanding why this happens so that I can mitigate the problem
understanding if there is a way of restoring the correct operations of a GPU without a reboot
Sorry, didn’t read your post, just the log.
My previous post still stands, a better psu doesn’t mean a psu with more wattage, it needs to withstand the power spikes you already know about. Rather check out a psu with high efficiency, those are more likely to have good capacitors to soften the peaks. If that still won’t work, there’s the slim chance of the mainboard’s regulators freaking out or lastly, the gpu being broken. That’s the <1% chance.
Thank you generix! I didn’t know efficiency was a characteristic I should have looked into for a psu, how is it measured (basically, how do I distinguish good psus)?
Also, could you explain how did you determine that that was the problem from reading the log? I’d like to be able to do that myself.
I noticed the freeze seems more probable when the average GPU utilization has a lot of spikes (anecdotally)
Lastly, by trial and error I determined that this command:
nvidia-smi -lgc 300, 1200
seems to stabilize things for now. Maybe I could just find the right max value and put it in a script?
I don’t think I reused any cable/adapter, because the old PSU had them bundled, but I will double check just to be sure.
That’s the usual test I recommend to check for psu issues. It’s a test and a work- around, nothing more.
sudo lspci -xx -d 10de:*
will output a lot of 0xff meaning the gpu is off. Meaning a power issue occured.
the usual advertised efficiency metrics are 80,85,90% or silver, gold, platinum (plutanium?). ML workloads are the most stretching, especially with a xx90 go for platinum. If both your psus were the same brand, change brand.