Unable to determine the device handle for GPU0000:17:00: Unknown Error

ZZZJN · September 9, 2024, 7:43am

GPU:2 * 3090

Hi! I have a machine learning workflow separately on two 3090s. And the two code configurations are completely the same. I have run several times and haven’t met any problem until last week. The GPU0 has fallen off the bus after 80 hours working , while the GPU1 still works fine. And I can not use ‘nvidia-smi’ to check the GPU state. After rebooting the problem seems fixed. But it appears again on the same GPU when I tried another training.

Cause the GPU1 can work fine under the same configuration while the GPU0 can not, it appears not a problem about the psu.

So I print the temperature before the fallen occured. The temperature is around 75 and far away from the Shutdown Temp. It seems that it’s not about overheating.

I would like your suggestion in:

how can I find the problem and get rid of it.

Here is the log file.
nvidia-bug-report.log.gz (367.6 KB)

Topic		Replies	Views
Unable to determine the device handle for GPU xxxxxxxx: Unknown Error Linux ubuntu , kb	4	11249	October 15, 2022
Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus Linux linux	6	738	November 9, 2023
UnaUnable to determine the device handle for GPU Linux	1	353	October 12, 2022
How to address the error. "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux boot , kb	1	2756	November 28, 2022
Unable to determine the device handle for GPU 0000:B1:00.0: Unknown Error Linux	2	1073	April 16, 2019
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5273	November 9, 2022
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	15	17207	February 4, 2025
Keep losing RTX 2080 GPU. Linux	3	543	October 1, 2019
one gpu card can not be founded by nvidia-smi CUDA Programming and Performance	1	1680	January 7, 2018
After running gpu burn, "unable to determine the device handle for GPU" Error appears on a 3090 graphics card Linux driver	1	568	September 6, 2022

Unable to determine the device handle for GPU0000:17:00: Unknown Error

Related topics