OS: Ubuntu 18.04.4 LTS
Driver Version: 515.57
GPUs: 3 x RTX8000
Last week, when I use my machine for deep learning experiments, the GPUs often get crashed, but the temperature has been below 81 degrees Celsius when training. Then when I type nvidia-smi
, there is an error Unable to determine the device handle for GPU 0000:19:00.0: Unknown Error
. This is the output of nvidia-debugdump --list
,:
Found 3 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
Here is the detailed info of bug report.
nvidia-bug-report.log.gz (312.0 KB)
I have no idea how to solve the problem.Can somebody help me? Thanks a lot!