ax06-nvidia-bug-report.log (7.2 MB)
We have 2 nodes where one of the GeForce RTX 2000’s get this error on nvidia-smi:
nvidia-smi
Unable to determine the device handle for GPU0000:DA:00.0: Unknown Error
I’ve perused the various related threads such as 1, 2, 3
nvidia-debugdump --list
ends with:
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x6): Unknown Error
This at least allows the other GPUs to appear:
nvidia-smi drain -p GPU0000:DA:00.0 -m 1
Successfully set GPU 00000000:DA:00.0 drain state to: draining.
debug log is attached.
Edit: I also see this error in the logs: unbindLock does not exist
is that a red herring or helpful? On reboot the GPU works for a little while and then the error re-occurs.