OS: Ubuntu 18.04
Driver Version: 510.54
GPUs: 2 GTX 1080
Hello, I used my one GPU, specified the device 0 in python code, by PyTorch in docker, and an error interrupted the process. My docker command is “docker run --runtime=nvidia --shm-size 24G -it --rm -v /ssd2T/HR/Documents/0_opensoure:/workspace mot:v1” and I run my python code in the docker bash. After meeting the error, I typed nvidia-smi, it showed that “Unable to determine the device handle for GPU 0000:0B:00.0: Unknown Error”. I met the same error many times last week. Normally, I just shut down the system and waited for few minutes to restart the system. (If I rebooted the system directly without a wait, the error might not disappear.) Then my GPUs could work well for a short time, maybe some hours, until I met this error again.
And this is the output of nvidia-debugdump --list.
Found 2 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce GTX 1080 (*PrimaryCard)
GPU internal ID: GPU-97101ecd-fc14-b59b-6805-6a04fb177598
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error
I also saved the nvidia-smi output every minute to the gpu.log and uploaded the gpu.log. According to the gpu.log, the error happened in “Sat Mar 19 17:33:14 2022”.