Unable to determine the device handle for GPU0000:DA:00.0: Unknown Error

ax06-nvidia-bug-report.log (7.2 MB)
We have 2 nodes where one of the GeForce RTX 2000’s get this error on nvidia-smi:

nvidia-smi
Unable to determine the device handle for GPU0000:DA:00.0: Unknown Error

I’ve perused the various related threads such as 1, 2, 3

nvidia-debugdump --list ends with:

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x6): Unknown Error

This at least allows the other GPUs to appear:

nvidia-smi drain -p GPU0000:DA:00.0 -m 1
Successfully set GPU 00000000:DA:00.0 drain state to: draining.

debug log is attached.

Edit: I also see this error in the logs: unbindLock does not exist is that a red herring or helpful? On reboot the GPU works for a little while and then the error re-occurs.