Something goes wrong with PCIe and Ubuntu freezes several times a day on dgx station v100

nvidia-bug-report.log.gz (517.0 KB)

nvidia-bug-report.log.gz has been uploaded and below is some info from syslog for your reference

Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.429467] NVRM: GPU at PCI:0000:0f:00: GPU-1ca07879-ec16-c356-5781-e1227dc3491d
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.429470] NVRM: GPU Board Serial Number: 0324518120193
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.429473] NVRM: Xid (PCI:0000:0f:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 2(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.429563] NVRM: GPU at PCI:0000:0e:00: GPU-63ba7c03-5e88-1809-6e8a-afea17ceb012
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.429565] NVRM: GPU Board Serial Number: 0324518120442
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.429569] NVRM: Xid (PCI:0000:0e:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 0(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.433338] NVRM: GPU at PCI:0000:07:00: GPU-220cb645-a1dc-eaa9-5196-f663bd43df01
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.433341] NVRM: GPU Board Serial Number: 0324418141545
Jan 4 16:12:07 ovsdl-DGX-Station kernel: [ 1111.433344] NVRM: Xid (PCI:0000:07:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 3(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
Jan 4 16:12:08 ovsdl-DGX-Station kernel: [ 1112.438695] NVRM: Xid (PCI:0000:0e:00): 61, pid=‘’, name=, 0a76(29f0) 00000000 00000000
Jan 4 16:12:08 ovsdl-DGX-Station kernel: [ 1112.438852] NVRM: Xid (PCI:0000:0f:00): 61, pid=‘’, name=, 0a76(29f0) 00000000 00000000
Jan 4 16:12:08 ovsdl-DGX-Station kernel: [ 1112.442057] NVRM: Xid (PCI:0000:07:00): 61, pid=‘’, name=, 0a76(29f0) 00000000 00000000
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427570] NVRM: GPU at PCI:0000:08:00: GPU-ceb60853-2618-02ad-a2a8-d4c72f186f3d
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427572] NVRM: GPU Board Serial Number: 0324418141428
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427574] NVRM: Xid (PCI:0000:08:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427576] NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427577] NVRM: GPU 0000:08:00.0: GPU serial number is 0324418141428.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1124.448571] NVRM: Xid (PCI:0000:07:00): 8, pid=3896, name=msedge, Channel 00000038
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963866] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963879] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963883] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963887] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963891] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981359] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981364] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981368] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981371] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981374] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981377] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998868] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)