Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100

This happened several times today.
Below is syslog

Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427570] NVRM: GPU at PCI:0000:08:00: GPU-ceb60853-2618-02ad-a2a8-d4c72f186f3d
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427572] NVRM: GPU Board Serial Number: 0324418141428
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427574] NVRM: Xid (PCI:0000:08:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427576] NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427577] NVRM: GPU 0000:08:00.0: GPU serial number is 0324418141428.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1124.448571] NVRM: Xid (PCI:0000:07:00): 8, pid=3896, name=msedge, Channel 00000038
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963866] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963879] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963883] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963887] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963891] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981359] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981364] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981368] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981371] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981374] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981377] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998868] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998878] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998881] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998885] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998887] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016377] NVRM: Xid (PCI:0000:07:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016379] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016381] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016383] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016385] NVRM: GPU 0000:07:00.0: GPU serial number is 0324418141545.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016388] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016391] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016394] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016397] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033892] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033897] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033900] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033903] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033909] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033911] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051401] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051406] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051409] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051412] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051416] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051418] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:25 ovsdl-DGX-Station /usr/lib/gdm3/gdm-x-session[3042]: (EE) client bug: timer event2 debounce: offset negative (-923ms)
Jan 4 16:12:25 ovsdl-DGX-Station /usr/lib/gdm3/gdm-x-session[3042]: (EE) client bug: timer event2 debounce short: offset negative (-936ms)
Jan 4 16:12:25 ovsdl-DGX-Station kernel: [ 1129.949400] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c37e:0:0:0x0000000f

Since it’s a DGX Station, I guess the gpu is broken. Or it became loose when the box has been moved around recently.

Thanks for your help.
Now the log info turned out to be like follow:

Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989180] INFO: task gnome-shell:2722 blocked for more than 120 seconds.
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989186] Tainted: P OEL 4.15.0-65-generic #74-Ubuntu
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989188] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989191] gnome-shell D 0 2722 2456 0x00000000
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989195] Call Trace:
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989203] __schedule+0x24e/0x880
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989208] ? ttwu_do_wakeup+0x1e/0x140
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989211] schedule+0x2c/0x80
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989214] rwsem_down_write_failed+0x1ea/0x360
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989220] ? __wake_up_common+0x73/0x130
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989224] call_rwsem_down_write_failed+0x17/0x30
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989227] ? call_rwsem_down_write_failed+0x17/0x30
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989231] down_write+0x2d/0x40
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989457] os_acquire_rwlock_write+0x3b/0x50 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989795] _nv038381rm+0xc/0x30 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990051] ? _nv039329rm+0x18d/0x1d0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990297] ? _nv041056rm+0x45/0xd0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990633] ? _nv041001rm+0x142/0x2b0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990883] ? _nv039291rm+0x15a/0x2e0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991128] ? _nv039292rm+0x5b/0x90 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991370] ? _nv039292rm+0x31/0x90 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991616] ? _nv012677rm+0x1d/0x30 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991862] ? _nv039307rm+0xb0/0xb0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992117] ? _nv012699rm+0x54/0x70 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992389] ? _nv011412rm+0xc4/0x120 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992644] ? _nv000657rm+0x63/0x70 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992890] ? _nv000580rm+0x2c/0x40 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993245] ? _nv000694rm+0x86c/0xc80 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993616] ? rm_ioctl+0x54/0xb0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993805] ? nvidia_ioctl+0x2dc/0x840 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993998] ? nvidia_frontend_unlocked_ioctl+0x42/0x50 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994007] ? do_vfs_ioctl+0xa8/0x630
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994014] ? __sys_recvmsg+0x80/0x90
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994020] ? SyS_ioctl+0x79/0x90
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994028] ? do_syscall_64+0x73/0x130
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994035] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 6 11:58:18 ovsdl-DGX-Station kernel: [ 1584.341120] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [gnome-shell:3201]
Jan 6 11:58:18 ovsdl-DGX-Station kernel: [ 1584.341123] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc aufs overlay iptable_filter nls_iso8859_1 binfmt_misc snd_hda_codec_hdmi intel_rapl sb_edac x86_pkg_temp_thermal coretemp kvm snd_hda_codec_realtek eeepc_wmi irqbypass crct10dif_pclmul snd_hda_codec_generic snd_seq_midi asus_wmi crc32_pclmul snd_seq_midi_event sparse_keymap ghash_clmulni_intel video intel_wmi_thunderbolt wmi_bmof mxm_wmi pcbc input_leds joydev snd_rawmidi snd_hda_intel aesni_intel aes_x86_64 crypto_simd snd_seq glue_helper snd_hda_codec cryptd snd_hda_core intel_cstate snd_hwdep intel_rapl_perf snd_pcm snd_seq_device snd_timer mei_me

The gpu1 overheats even I run my code on other gpus. And then the gpu1 fall of the bus. Do you have any advice for this situation?

Sounds like a heat sink got loose. I can’t really help with that since the DGX Stations are custom designs by nvidia. Rather check with the vendor you got it from to send it in for repair.

Thanks for your help. Now when gpu1 overheats, it will make the system freeze. But other three gpus are good. How can I just use other three gpus so at least the system can work normally?

You can use nvidia-smi drain
https://unix.stackexchange.com/questions/654075/how-can-i-disable-and-later-re-enable-one-of-my-nvidia-gpus