Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

OS: Ubuntu 20.04
Driver Version: 470.82.00
GPUs: 2 x RTX3090

When I use my new machine for deep learning experiments, the GPUs often get crashed. Then when I type nvidia-smi, there is an error Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error. This is the output of nvidia-debugdump --list:

Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is the detailed info of bug report.
nvidia-bug-report.log.gz (555.3 KB)

I have no idea how to solve the problem.Can somebody help me? Thanks a lot!

[ 2385.777236] NVRM: Xid (PCI:0000:01:00): 79, pid=1420, GPU has fallen off the bus.
[ 2385.777238] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Please monitor temperature to rule out overheating, try limiting clocks using nvidia-smi -lgc to check for psu issues on gpu boost.