Kernel: NVRM: Xid: 74, name=<unknown>, NVLink: fatal error detected on link; rmmod: ERROR: could not remove 'nvidia_uvm': Resource temporarily unavail

We have 8 RTX A6000 in a Supermicro chassis in RHEL 9 and one of the GPUs clearly has an issue.
RuntimeError: ProcessGroupNCCL is only supported with GPUs no GPUs found

kernel: NVRM: Xid (PCI:0000:c1:00): 74, pid='<unknown>', name=<unknown>, NVLink: fatal error detected on link 1(0x10000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)

According to this page error 74 means NVLINK error.

Here’s another error:

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x7): Unknown Error

nvidia-bug-report.log.gz (633.2 KB)

I’ve tried everything to stop cuda-driver, nvidia-persistenced and dcgm services. I can’t unload these last 2 modules as I’m trying to avoid a reboot

rmmod -f -v nvidia_uvm
rmmod: ERROR: could not remove 'nvidia_uvm': Resource temporarily unavailable
rmmod: ERROR: could not remove module nvidia_uvm: Resource temporarily unavailablermmod -f -v nvidia_uvm
rmmod: ERROR: could not remove 'nvidia_uvm': Resource temporarily unavailable
rmmod: ERROR: could not remove module nvidia_uvm: Resource temporarily unavailable

lsof /dev/nvidia*

Is there a way to kill these?

lsmod | grep nvidia
nvidia_uvm           1552384  8
nvidia              56696832  747 nvidia_uvm,nvidia_peermem,nvidia_modeset
May 10 16:15:05 server cuda-driver[3392144]: modprobe: FATAL: Module nvidia_uvm is in use.
May 10 16:15:05 server cuda-driver[3392145]: modprobe: FATAL: Module nvidia is in use.

Attached is the bug report.