How to solve NVRM: GPU 0000:01:00.0: GPU has fallen off the bus completely?

I training DL model on GPU and stuck after 20 minutes, also cannot access to GPU by nvidia-smi when the error occur. After rebooting, I can access to GPU by nvidia-smi without error, but when run training program, the problem happened again after training 20 minutes. I using same program to do training DL model many hours without error before, so annoying and wired.

Some root cause and strategy of error:

  1. Overheating
  2. Insufficient/unstable power supply
  3. Replace slot
  4. System bios updates

I monitor temperatures when training, and the temperatures always < 50 C, so it is not overheating. I Also try enable persistent mode but not work. Finally I using solution from Unable to determine the device handle for GPU xxxxxxxx: Unknown Error, using following command temporarily to solve the error, training program executing stably for 30 minutes and error happened again. I want to how to solve the root cause completely?

temporarily solve the error by:

nvidia-smi -lgc 300,1500

some useful message:

  • GPU and driver:
    • driver: 515.76
    • GPU: NVIDIA GeForce RTX 3060
    • CUDA: 11.7
    • Tensorflow: 2.10.0
  • dmesg -T
    [Sun Oct 16 11:45:40 2022] NVRM: GPU at PCI:0000:01:00: GPU-903cc954-07f3-f490-e3d4-7e79bffaa22f
    [Sun Oct 16 11:45:40 2022] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
    [Sun Oct 16 11:45:40 2022] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
    [Sun Oct 16 11:45:40 2022] NVRM: A GPU crash dump has been created. If possible, please run
                               NVRM: nvidia-bug-report.sh as root to collect this data before
                               NVRM: the NVIDIA kernel module is unloaded.
    
  • nvidia-debugdump --list
    Error: nvmlDeviceGetHandleByIndex(): Unknown 
    Error FAILED to get details on GPU (0x0): Unknown Error
    
  • nvidia-smi
    Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error
    
  • bug repor
    nvidia-bug-report.log.gz (127.3 KB)

I have the exact same error with a gtx 970

[ 1799.393003] pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
[ 1799.442780] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 1799.442781] pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00002001/00002000
[ 1799.442782] pcieport 0000:00:01.0:    [ 0] RxErr                 

Your pcie bus is breaking down. In case you’re using a riser, please remove it. Another reason for this might be the pcie chipset on the mainboard overheating, please check if this is actively cooled, fan is working, no dust. You can try to work around it lowering pcie gen in bios if available.