We have four A6000 on DELL 750XA servers, which have been running well in the past year. Recently, there have been issues such as GPU has fallen off the bus; The attachment is the NVIDIA bug report log file. Please help identify the issue.
DMESG :
[245533.041105] NVRM: GPU at PCI:0000:65:00: GPU-f3b02492-9e55-7fd2-ba39-a26df0bf8a2e
[245533.041113] NVRM: GPU Board Serial Number: 1320921021664
[245533.041114] NVRM: Xid (PCI:0000:65:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[245533.041118] NVRM: GPU 0000:65:00.0: GPU has fallen off the bus.
[245533.041120] NVRM: GPU 0000:65:00.0: GPU serial number is 1320921021664.
[245533.041120] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[245533.041126] {2}[Hardware Error]: event severity: recoverable
[245533.041129] {2}[Hardware Error]: Error 0, type: fatal
[245533.041131] {2}[Hardware Error]: section_type: PCIe error
[245533.041133] {2}[Hardware Error]: port_type: 4, root port
[245533.041135] {2}[Hardware Error]: version: 3.0
[245533.041136] {2}[Hardware Error]: command: 0x0547, status: 0x4010
[245533.041138] {2}[Hardware Error]: device_id: 0000:64:02.0
[245533.041140] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[245533.041140] {2}[Hardware Error]: slot: 33
[245533.041142] {2}[Hardware Error]: secondary_bus: 0x65
[245533.041143] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x347a
[245533.041145] {2}[Hardware Error]: class_code: 060400
[245533.041147] {2}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[245533.041149] {2}[Hardware Error]: aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01310000
[245533.041151] {2}[Hardware Error]: aer_uncor_severity: 0x044ef030
[245533.041152] {2}[Hardware Error]: TLP Header: ffffffff ffffffff ffffffff ffffffff
[245533.041205] pcieport 0000:64:02.0: AER: aer_status: 0x00000020, aer_mask: 0x01310000
[245533.041209] pcieport 0000:64:02.0: [ 5] SDES (First)
[245533.041213] pcieport 0000:64:02.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[245533.041215] pcieport 0000:64:02.0: AER: aer_uncor_severity: 0x044ef030
[245533.041218] nvidia 0000:65:00.0: AER: can’t recover (no error_detected callback)
[245533.041221] snd_hda_intel 0000:65:00.1: AER: can’t recover (no error_detected callback)
[245535.078904] pcieport 0000:64:02.0: Data Link Layer Link Active not set in 1000 msec
[245535.078911] pcieport 0000:64:02.0: AER: Root Port link has been reset (-25)
[245535.078914] pcieport 0000:64:02.0: AER: subordinate device reset failed
[245535.078946] pcieport 0000:64:02.0: AER: device recovery failed
nvidia-bug-report.log.gz (2.6 MB)