OS: Ubuntu 20.04
Driver Version: 470.82.00
GPUs: 2 x RTX3090
When I use my new machine for deep learning experiments, the GPUs often get crashed. Then when I type nvidia-smi, there is an error Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error. This is the output of nvidia-debugdump --list:
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
[ 2385.777236] NVRM: Xid (PCI:0000:01:00): 79, pid=1420, GPU has fallen off the bus.
[ 2385.777238] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Please monitor temperature to rule out overheating, try limiting clocks using nvidia-smi -lgc to check for psu issues on gpu boost.