Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

When I typed command nvidia-smi , Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error was returned.

I then typednvidia-debugdump --list, here is the result:

Found 2 NVIDIA devices
        Device ID:              0
        Device name:            NVIDIA TITAN X (Pascal)   (*PrimaryCard)
        GPU internal ID:        0324416077500

detailed info of bug report:
nvidia-bug-report.log (2.2 MB)

I don’t know how to approach this problem, so I am asking for help.

OS: Linux version 4.15.0-142-generic
GPU: 2*NVIDIA TITAN X

1 Like

nvidia-bug-report.log (3.5 MB)
Hello, I have the same problem. My Nvidia A2000 is not working with what I believe is the latest driver (520.56.06). I have a linux kernel 5.15.0-53 with generic headers…

On my side, the nvidia-debugdump --list says the following :

~$ nvidia-debugdump --list
Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Not Found
FAILED to get details on GPU (0x0): Not Found

Also, I have this output for nvidia-smi :

~$ nvidia-smi -L
Unable to determine the device handle for gpu 0000:01:00.0: Not Found

Hello again, I kept scrapping the forums and I think you can check this : Nvidia-smi outputs “No devices were found” on Ubuntu 22.04 + driver 520 - #2 by generix

On my side I changed the drivers to a non “open kernel” version and restarted my machine. The nvidia-smi works again and I can use tools such as gpustat

Hope this helps !

NVRM: Xid (PCI:0000:02:00): 79, pid=1160, GPU has fallen off the bus.

[15028104.848929] pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010
[15028104.848952] pcieport 0000:00:02.0: can't find device of ID0010
[15028104.848955] pcieport 0000:00:02.0: AER: Corrected error received: id=0010
[15028104.848961] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
[15028104.848966] pcieport 0000:00:02.0:   device [8086:2f04] error status/mask=00000040/00002000
[15028104.848972] pcieport 0000:00:02.0:    [ 6] Bad TLP       

Please reboot. If the gpu still doesn’t show up, it’s probably broken, check if it works in another system.