Unable to determine the device handle for GPU xxxxxxxx: Unknown Error

conan.ye · October 9, 2022, 2:03pm

OS: Ubuntu 20.04.1 LTS
Driver Version: 515.65.01
GPUs: 4 3090
Power Supply: 4000W

Hi,
I keep encountering ‘Unable to determine the device handle for GPU 0000:xx:00.0: Unknown Error’, where xx could be varied IDs representing different GPU cards.

Here’s more information:

$ nvidia-debugdump --list
Found 4 NVIDIA devices
        Device ID:              0
        Device name:            NVIDIA GeForce RTX 3090
        GPU internal ID:        GPU-ffde8868-a687-26b5-6fa1-511ef8a21e93

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

Additionally, here’s nvidia-bug-report.sh output:
nvidia-bug-report.log.gz (570.4 KB)

I thought it might result from outdated drivers, so I updated drivers, but it seems not to be working, the error keeps occurring.

Could somebody help me? Thanks a lot!

generix · October 12, 2022, 8:46am

You’re getting an XID 79, fallen off the bus. Since it’s always a different gpu, this might be either overheating or lack of (peak) power.
Please monitor temperatures, limit clocks using nvidia-smi -lgc to check for psu issues.

conan.ye · October 14, 2022, 5:36am

Hi, thanks for your reply.

It’s actually usually GPU 0 to be fallen off the bus.
If GPU 0 hasn’t any workloads, GPU 1 will be fallen off the bus.

There is little possibility that lack of (peak) power because we have another almost the same server that can run 8 3090 at the same time stably. The only difference between these two servers is the GPU cards are brought from different manufacturers with different cooling designs.

I checked the temperature logs, and the temperature of the core is around 80 degrees. It seems not too high.
Is it probably be overheating problem of memory? There’s some report on the Internet saying that some 3090s have memory overheating problems.


==============NVSMI LOG==============

Timestamp                                 : Thu Oct 13 23:57:01 2022
Driver Version                            : 515.65.01
CUDA Version                              : 11.7

Attached GPUs                             : 4
GPU 00000000:1B:00.0
    Temperature
        GPU Current Temp                  : 78 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:3E:00.0
    Temperature
        GPU Current Temp                  : 69 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:89:00.0
    Temperature
        GPU Current Temp                  : 68 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:B2:00.0
    Temperature
        GPU Current Temp                  : 63 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

generix · October 14, 2022, 10:52am

The problem with overheating memory is very real but only results in the gpu going into some error state. Never results in an XID 79, at least to my observations.
On two consecutive boots, two different gpus were affected:

Xid (PCI:0000:3e:00): 79
Xid (PCI:0000:1b:00): 79

So it’s not only one gpu concerned.
Since it’s very easy to do, you should check for peak power issues first, preventing boost using nvidia-smi -lgc 300,1500 on all gpus. If a fallen off the bus still occurs, it’s something different.

conan.ye · October 15, 2022, 6:52am

It seems to work.
After setting ‘nvidia-smi -lgc 300,1500’, it runs stably for 20hours.
It does seem to be peak power issues.
Thanks a lot!

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	15	16450	February 4, 2025
How to address the error. "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux boot , kb	1	2685	November 28, 2022
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	201	October 31, 2024
Unable to determine the device handle for GPU Linux	14	10100	September 14, 2022
Unable to determine the device handle for GPU0000:17:00: Unknown Error Linux	0	80	September 9, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5118	November 9, 2022
Unable to determine the device handle for GPU 0000:19:00.0: Unknown Error Linux	1	1181	September 30, 2022
Unable to determine the device handle for GPU 0000:81:00.0: Unknown Error Linux	0	147	October 29, 2024
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux linux-driver , 24-ubuntu	5	1040	January 18, 2024
Please! Unable to determine the device handle for GPU0000:04:00.0: Unknown Error Linux	1	42	November 15, 2024

Unable to determine the device handle for GPU xxxxxxxx: Unknown Error

Related topics