Unable to determine the device handle for GPU0000:18:00.0: Unknown Error

Hi

We have an Ubuntu 20.04.4LTS server system with an A100-40GB and an A100-80GB. We recently added the A100-80GB, and since then, the system no longer runs reliably.

After a reboot, everything works fine, and both GPUs are shown in nvidia-smi. However, after some time, anything GPU-related fails. For example, running nvidia-smi produces “Unable to determine the device handle for GPU0000:18:00.0: Unknown Error”

Running dmesg | grep GPU gives the following:
[ 7.925206] [drm] [nvidia-drm] [GPU ID 0x00001800] Loading driver
[ 7.925371] [drm] [nvidia-drm] [GPU ID 0x0000af00] Loading driver
[ 2733.970482] NVRM: GPU at PCI:0000:18:00: GPU-658f30fe-173b-5217-b11b-fd2265868f92
[ 2733.970511] NVRM: GPU Board Serial Number: 1655222023683
[ 2733.970516] NVRM: Xid (PCI:0000:18:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[ 2733.970522] NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
[ 2733.970526] NVRM: GPU 0000:18:00.0: GPU serial number is 1655222023683.
[ 2733.970543] NVRM: A GPU crash dump has been created. If possible, please run

Based on this, we followed the suggestions in other posts, we tried re-seating the GPU, but keep running into the same issue. I’m attaching the output of nvidia-debugdump after reseating. Any help is greatly appreciated.

nvidia-bug-report-afterreseating.log.gz (208.1 KB)