I started running some cuda jobs on a machine with 10 * RTX3090.A few hours later, when i check how it goes with the cmd nvidia-smi, only get the error output: Unable to determine the device handle for GPU 0000:1E:00.0: GPU is lost. Reboot the system to recover this GPU.
GPUs: 10 * RTX3090
Driver Version: 455.23.05
CUDA Version: 11.1
Max Output Power: 8000w
nvidia-bug-report.sh log: nvidia-bug-report.log.gz (4.6 MB)
Is there any one knows why the gpu is lost?