Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

OS: Ubuntu 20.04
Driver Version: 470.82.00
GPUs: 2 x RTX3090

When I use my new machine for deep learning experiments, the GPUs often get crashed. Then when I type nvidia-smi, there is an error Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error. This is the output of nvidia-debugdump --list:

Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is the detailed info of bug report.
nvidia-bug-report.log.gz (555.3 KB)

I have no idea how to solve the problem.Can somebody help me? Thanks a lot!

[ 2385.777236] NVRM: Xid (PCI:0000:01:00): 79, pid=1420, GPU has fallen off the bus.
[ 2385.777238] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Please monitor temperature to rule out overheating, try limiting clocks using nvidia-smi -lgc to check for psu issues on gpu boost.

OS: Ubuntu 22.04.1 LTS
Driver Version: nvidia/520.56.06
GPUs: GeForce RTX 2070 SUPER

When i type nvidia-smi in the terminal it shows the following error
“Unable to determine the device handle for GPU 0000:01:00.0: Not Found”

And I got following results when i use

$nvidia-debugdump -l

Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Not Found
FAILED to get details on GPU (0x0): Not Found

$nvidia-debugdump -z -D

nvmlInit succeeded
Using ALL devices
Dumping all components.
nvdZip_Open(dump.zip) for writing succeeded
System: Dumping component: system_info.
GetCaptureBufferSize succeeded, bufSize: 0xfc
GetCaptureBuffer succeeded, bufSize: 0xc7
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
System: Dumping component: error_data.
GetCaptureBufferSize succeeded, bufSize: 0x57
GetCaptureBuffer succeeded, bufSize: 0x32
nvdZip_AddFile succeeded
internal_dumpSystemComponent() succeeded
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x6
ERROR: internal_dumpGpuComponent() failed, return code: 0x6
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
nvdZip_Close() succeeded

I’m new to linux kindly help me