Unable to determine the device handle for GPU 0000:89:00.0: Unknown Error

I was use gpu in docker, and suddenly it crash, my docker container become unhealthy, and stop does’t make sence.
Please help me, help kids.

Unable to determine the device handle for GPU 0000:89:00.0: Unknown Error
root@agent-192-168-1-65:~# lspci | grep NV | grep VGA
1a:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
1b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
3d:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
3e:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
88:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
89:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev ff)
b1:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
b2:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
root@agent-192-168-1-65:~# nvidia-smi -i 0000:89:00.0 -pm 0
Unable to determine the device handle for GPU 0000:89:00.0: Unknown Error
root@agent-192-168-1-65:~# nvidia-smi drain -p 0000:89:00.0 -m 1
Successfully set GPU 00000000:89:00.0 drain state to: draining.

nvidia-bug-report.log.gz (3.7 MB)

Found Xid 79 in nvidia-bug-report.log

[2406993.511530] NVRM: GPU at PCI:0000:89:00: GPU-e2b5c237-4750-e614-2c55-cb411e38d637
[2406993.511533] NVRM: Xid (PCI:0000:89:00): 79, pid=2051, GPU has fallen off the bus.
[2406993.511535] NVRM: GPU 0000:89:00.0: GPU has fallen off the bus.
[2406993.511560] NVRM: Xid (PCI:0000:89:00): 79, pid=2054, GPU has fallen off the bus.
[2406993.511562] NVRM: GPU 0000:89:00.0: GPU has fallen off the bus.

Might be overheating or lack of power. Please check airflow, monitor temperatures. Check/swap power cords. Since it’s always the same gpu, it might also be broken.