OS: Ubuntu 20.04
Driver Version: 470.82.00
GPUs: 2 x RTX3090
When I use my new machine for deep learning experiments, the GPUs often get crashed. Then when I type nvidia-smi
, there is an error Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error
. This is the output of nvidia-debugdump --list
:
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
Here is the detailed info of bug report.
nvidia-bug-report.log.gz (555.3 KB)
I have no idea how to solve the problem.Can somebody help me? Thanks a lot!