Unable to determine the device handle for GPU

justforted · October 15, 2022, 4:00pm

i run my torch model about every 20min, then it give the error, i dont know why something wrong!
nvidia-bug-report.log (821.2 KB)

Every 1.0s: nvidia-smi zfx: Sat Oct 15 23:59:46 2022
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

MarkusHoHo · October 18, 2022, 2:19pm

Hello @justforted, welcome to the NVIDIA developer forums.

Thank you for sharing the log file.

Could you please share also the complete setup of yours? I can see an RTX 3080Ti, but which CPU, what platform? Desktop or Server? And is it correct, that you are running Ubuntu 20.04 with an RTX 3080 Ti?

The errors I found in the log suggest that there might be something physically wrong with the GPU.

Oct 9 14:08:58 zfx kernel: [ 8683.941534] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
Oct 9 14:08:58 zfx kernel: [ 8683.941536] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Did you monitor the temperatures of the GPU while you are running your torch models? Could it be the GPU is running too hot?

justforted · October 19, 2022, 2:12pm

yes, ubuntu 20.04+driver= x86_64-515.76.run+cuda11.6+pytorch12， i didn’t monitor temperatures, but if i just reboot the machine，the gpu fails with the error above after i run 3 epoches.
————————————————
after i monitor the temperature, the highest temperature is j’ust 63C, i don’t know why it occurs such error!

nvidia-bug-report.log (1.1 MB)

Topic		Replies	Views
How to address the error. "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux boot , kb	1	2625	November 28, 2022
UnaUnable to determine the device handle for GPU Linux	1	338	October 12, 2022
Unable to determine the device handle for GPU xxxxxxxx: Unknown Error Linux ubuntu , kb	4	10477	October 15, 2022
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux linux-driver , 24-ubuntu	5	994	January 18, 2024
Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus Linux linux	6	673	November 9, 2023
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux ubuntu , driver , linux-driver	1	692	September 3, 2023
Unable to determine the device handle for GPU 0000:42:00.0: Unknown Error for NVIDIA GeForce RTX 3080 Linux ubuntu	0	52	August 6, 2024
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	1	1058	September 15, 2022
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	4962	November 9, 2022
Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error Linux	1	47	October 7, 2024

Unable to determine the device handle for GPU

Related topics