Unable to determine the device handle for GPU 0000:0B:00.0: Unknown Error, while using GPU in docker

OS: Ubuntu 18.04
Driver Version: 510.54
GPUs: 2 GTX 1080

Hello, I used my one GPU, specified the device 0 in python code, by PyTorch in docker, and an error interrupted the process. My docker command is “docker run --runtime=nvidia --shm-size 24G -it --rm -v /ssd2T/HR/Documents/0_opensoure:/workspace mot:v1” and I run my python code in the docker bash. After meeting the error, I typed nvidia-smi, it showed that “Unable to determine the device handle for GPU 0000:0B:00.0: Unknown Error”. I met the same error many times last week. Normally, I just shut down the system and waited for few minutes to restart the system. (If I rebooted the system directly without a wait, the error might not disappear.) Then my GPUs could work well for a short time, maybe some hours, until I met this error again.

And this is the output of nvidia-debugdump --list.

Found 2 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce GTX 1080 (*PrimaryCard)
GPU internal ID: GPU-97101ecd-fc14-b59b-6805-6a04fb177598

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

I also saved the nvidia-smi output every minute to the gpu.log and uploaded the gpu.log. According to the gpu.log, the error happened in “Sat Mar 19 17:33:14 2022”.

Could you help me to check this error? Thanks a lot.
gpu.log (3.4 MB)
nvidia-bug-report.log.gz (1.7 MB)

NVRM: Xid (PCI:0000:0b:00): 79, pid=9043, GPU has fallen off the bus.
Please check for overheating, monitor temperatures. Maybe fans are broken/clogged.

Thanks, I will check the GPU fans.