Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

meditans · May 31, 2024, 4:30pm

OS: NixOS
GPUs: 3090

uname: Linux h2 5.10.215 #1-NixOS SMP Sat Apr 13 10:59:59 UTC 2024 x86_64 GNU/Linux
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.78 Sun Apr 14 06:35:45 UTC 2024
GCC version: gcc version 13.2.0 (GCC)

Hi! I have a machine learning workflow I’m running on a 3090.

Sometimes though, the 3090 falls off the bus and the fan starts spinning very loudly. At that point, nvidia-smi returns:

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

I originally read on the internet that this could be caused by power spikes, so I bought a larger PSU (1000W), but the problem persists.
I’m also fairly confident it’s the card’s fault, because I have a smaller 3070 in which this kind of fail never happened (even when installed in the same PCIe slot).

I include my nvidia-bug-report.log

I would like your suggestion in:

understanding why this happens so that I can mitigate the problem
understanding if there is a way of restoring the correct operations of a GPU without a reboot

Thank you!

nvidia-bug-report.log (162.0 KB)

generix · May 31, 2024, 9:18pm

The gpu turned off. Get a better psu.

generix · May 31, 2024, 9:35pm

Sorry, didn’t read your post, just the log.
My previous post still stands, a better psu doesn’t mean a psu with more wattage, it needs to withstand the power spikes you already know about. Rather check out a psu with high efficiency, those are more likely to have good capacitors to soften the peaks. If that still won’t work, there’s the slim chance of the mainboard’s regulators freaking out or lastly, the gpu being broken. That’s the <1% chance.

generix · May 31, 2024, 9:50pm

When changing the psu, did you reuse any power cables/adapters?

meditans · May 31, 2024, 9:54pm

Thank you generix! I didn’t know efficiency was a characteristic I should have looked into for a psu, how is it measured (basically, how do I distinguish good psus)?

Also, could you explain how did you determine that that was the problem from reading the log? I’d like to be able to do that myself.

I noticed the freeze seems more probable when the average GPU utilization has a lot of spikes (anecdotally)
Lastly, by trial and error I determined that this command:

nvidia-smi -lgc 300, 1200

seems to stabilize things for now. Maybe I could just find the right max value and put it in a script?

I don’t think I reused any cable/adapter, because the old PSU had them bundled, but I will double check just to be sure.

Thanks again!

generix · May 31, 2024, 11:04pm

That’s the usual test I recommend to check for psu issues. It’s a test and a work- around, nothing more.

sudo lspci -xx -d 10de:*
will output a lot of 0xff meaning the gpu is off. Meaning a power issue occured.

the usual advertised efficiency metrics are 80,85,90% or silver, gold, platinum (plutanium?). ML workloads are the most stretching, especially with a xx90 go for platinum. If both your psus were the same brand, change brand.

Topic		Replies	Views
Unable to determine the device handle for GPU xxxxxxxx: Unknown Error Linux ubuntu , kb	4	9706	October 15, 2022
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	3793	March 12, 2024
Computer keeps losing GPU: Unable to determine the device handle for GPU0000:06:00.0: Unknown Error Linux	2	547	December 9, 2023
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	14	13627	July 4, 2024
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	16	40114	March 14, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux kernel , nvidia-smi	7	4140	September 8, 2022
Unable to determine the device handle for GPU 0000:81:00.0: Unknown Error Linux	0	58	October 29, 2024
RTX 4090 frequently encounters the issue "Unable to determine the device handle for GPU 0000:06:00.0: Unknown Error". It seems not to be an overheatin Linux pcie , cuda , ubuntu , power , python	5	565	March 28, 2024
Unable to determine the device handle for GPU0000:17:00: Unknown Error Linux	0	38	September 9, 2024
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux linux-driver , 24-ubuntu	5	940	January 18, 2024

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

Related topics