Unable to determine the device handle for GPU0000:18:00.0: Unknown Error

jmlindegger · May 27, 2023, 9:12am

Hi

We have an Ubuntu 20.04.4LTS server system with an A100-40GB and an A100-80GB. We recently added the A100-80GB, and since then, the system no longer runs reliably.

After a reboot, everything works fine, and both GPUs are shown in nvidia-smi. However, after some time, anything GPU-related fails. For example, running nvidia-smi produces “Unable to determine the device handle for GPU0000:18:00.0: Unknown Error”

Running dmesg | grep GPU gives the following:
[ 7.925206] [drm] [nvidia-drm] [GPU ID 0x00001800] Loading driver
[ 7.925371] [drm] [nvidia-drm] [GPU ID 0x0000af00] Loading driver
[ 2733.970482] NVRM: GPU at PCI:0000:18:00: GPU-658f30fe-173b-5217-b11b-fd2265868f92
[ 2733.970511] NVRM: GPU Board Serial Number: 1655222023683
[ 2733.970516] NVRM: Xid (PCI:0000:18:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[ 2733.970522] NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
[ 2733.970526] NVRM: GPU 0000:18:00.0: GPU serial number is 1655222023683.
[ 2733.970543] NVRM: A GPU crash dump has been created. If possible, please run

Based on this, we followed the suggestions in other posts, we tried re-seating the GPU, but keep running into the same issue. I’m attaching the output of nvidia-debugdump after reseating. Any help is greatly appreciated.

nvidia-bug-report-afterreseating.log.gz (208.1 KB)

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:19:00.0: Unknown Error Linux	1	1031	August 30, 2021
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Drivers - Linux, Windows, MacOS	1	382	September 14, 2024
Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error Linux	1	143	October 7, 2024
Unable to determine the device handle for GPU 0000:19:00.0: Unknown Error Linux	1	1268	September 30, 2022
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	313	October 31, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5438	November 9, 2022
How to address the error. "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux boot , kb	1	2857	November 28, 2022
UnaUnable to determine the device handle for GPU Linux	1	378	October 12, 2022
Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU Linux	4	1003	January 18, 2022
Unable to determine the device handle Linux	1	361	December 17, 2023

Unable to determine the device handle for GPU0000:18:00.0: Unknown Error

Related topics