A6000 GPU has fallen off the bus

We have four A6000 on DELL 750XA servers, which have been running well in the past year. Recently, there have been issues such as GPU has fallen off the bus; The attachment is the NVIDIA bug report log file. Please help identify the issue.

DMESG :

[245533.041105] NVRM: GPU at PCI:0000:65:00: GPU-f3b02492-9e55-7fd2-ba39-a26df0bf8a2e
[245533.041113] NVRM: GPU Board Serial Number: 1320921021664
[245533.041114] NVRM: Xid (PCI:0000:65:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[245533.041118] NVRM: GPU 0000:65:00.0: GPU has fallen off the bus.
[245533.041120] NVRM: GPU 0000:65:00.0: GPU serial number is 1320921021664.
[245533.041120] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[245533.041126] {2}[Hardware Error]: event severity: recoverable
[245533.041129] {2}[Hardware Error]: Error 0, type: fatal
[245533.041131] {2}[Hardware Error]: section_type: PCIe error
[245533.041133] {2}[Hardware Error]: port_type: 4, root port
[245533.041135] {2}[Hardware Error]: version: 3.0
[245533.041136] {2}[Hardware Error]: command: 0x0547, status: 0x4010
[245533.041138] {2}[Hardware Error]: device_id: 0000:64:02.0
[245533.041140] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[245533.041140] {2}[Hardware Error]: slot: 33
[245533.041142] {2}[Hardware Error]: secondary_bus: 0x65
[245533.041143] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x347a
[245533.041145] {2}[Hardware Error]: class_code: 060400
[245533.041147] {2}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[245533.041149] {2}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01310000
[245533.041151] {2}[Hardware Error]: aer_uncor_severity: 0x044ef030
[245533.041152] {2}[Hardware Error]: TLP Header: ffffffff ffffffff ffffffff ffffffff
[245533.041205] pcieport 0000:64:02.0: AER: aer_status: 0x00000020, aer_mask: 0x01310000
[245533.041209] pcieport 0000:64:02.0: [ 5] SDES (First)
[245533.041213] pcieport 0000:64:02.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[245533.041215] pcieport 0000:64:02.0: AER: aer_uncor_severity: 0x044ef030
[245533.041218] nvidia 0000:65:00.0: AER: can’t recover (no error_detected callback)
[245533.041221] snd_hda_intel 0000:65:00.1: AER: can’t recover (no error_detected callback)
[245535.078904] pcieport 0000:64:02.0: Data Link Layer Link Active not set in 1000 msec
[245535.078911] pcieport 0000:64:02.0: AER: Root Port link has been reset (-25)
[245535.078914] pcieport 0000:64:02.0: AER: subordinate device reset failed
[245535.078946] pcieport 0000:64:02.0: AER: device recovery failed

nvidia-bug-report.log.gz (2.6 MB)

The two gpus are turned off which points to a power issue. Please check power connectors/psu.