Unable to determine the device handle for GPU xxxxxxxx: Unknown Error

OS: Ubuntu 20.04.1 LTS
Driver Version: 515.65.01
GPUs: 4 3090
Power Supply: 4000W

Hi,
I keep encountering ‘Unable to determine the device handle for GPU 0000:xx:00.0: Unknown Error’, where xx could be varied IDs representing different GPU cards.

Here’s more information:

$ nvidia-debugdump --list
Found 4 NVIDIA devices
        Device ID:              0
        Device name:            NVIDIA GeForce RTX 3090
        GPU internal ID:        GPU-ffde8868-a687-26b5-6fa1-511ef8a21e93

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

Additionally, here’s nvidia-bug-report.sh output:
nvidia-bug-report.log.gz (570.4 KB)

I thought it might result from outdated drivers, so I updated drivers, but it seems not to be working, the error keeps occurring.

Could somebody help me? Thanks a lot!

You’re getting an XID 79, fallen off the bus. Since it’s always a different gpu, this might be either overheating or lack of (peak) power.
Please monitor temperatures, limit clocks using nvidia-smi -lgc to check for psu issues.

Hi, thanks for your reply.

It’s actually usually GPU 0 to be fallen off the bus.
If GPU 0 hasn’t any workloads, GPU 1 will be fallen off the bus.

There is little possibility that lack of (peak) power because we have another almost the same server that can run 8 3090 at the same time stably. The only difference between these two servers is the GPU cards are brought from different manufacturers with different cooling designs.

I checked the temperature logs, and the temperature of the core is around 80 degrees. It seems not too high.
Is it probably be overheating problem of memory? There’s some report on the Internet saying that some 3090s have memory overheating problems.


==============NVSMI LOG==============

Timestamp                                 : Thu Oct 13 23:57:01 2022
Driver Version                            : 515.65.01
CUDA Version                              : 11.7

Attached GPUs                             : 4
GPU 00000000:1B:00.0
    Temperature
        GPU Current Temp                  : 78 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:3E:00.0
    Temperature
        GPU Current Temp                  : 69 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:89:00.0
    Temperature
        GPU Current Temp                  : 68 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:B2:00.0
    Temperature
        GPU Current Temp                  : 63 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

The problem with overheating memory is very real but only results in the gpu going into some error state. Never results in an XID 79, at least to my observations.
On two consecutive boots, two different gpus were affected:

Xid (PCI:0000:3e:00): 79
Xid (PCI:0000:1b:00): 79

So it’s not only one gpu concerned.
Since it’s very easy to do, you should check for peak power issues first, preventing boost using nvidia-smi -lgc 300,1500 on all gpus. If a fallen off the bus still occurs, it’s something different.

It seems to work.
After setting ‘nvidia-smi -lgc 300,1500’, it runs stably for 20hours.
It does seem to be peak power issues.
Thanks a lot!