Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

Hello,

I’m getting an error message when I type nvidia-smi (2 devices NVIDIA GeForce RTX 2080 Ti ):

  • Unable to determine the device handle for GPU 0000:0A:00.0: Unknown Error

Also when I’m trying to connect with docker I get this one:

  • docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: nvml error: unknown error: unknown.

I did a reboot and then started again training a deep learning model but after two epochs I got the same error. You can find attached the log file from the bug.

Thank you in advance for your help!
nvidia-bug-report.log (1).gz (313.1 KB)

Xid (PCI:0000:0a:00): 79, pid=423231, GPU has fallen off the bus.
Overheating or lack of power. Please check temperatures, check PSU.