Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100

huxy7913 · January 4, 2023, 8:33am

This happened several times today.
Below is syslog

Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427570] NVRM: GPU at PCI:0000:08:00: GPU-ceb60853-2618-02ad-a2a8-d4c72f186f3d
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427572] NVRM: GPU Board Serial Number: 0324418141428
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427574] NVRM: Xid (PCI:0000:08:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427576] NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427577] NVRM: GPU 0000:08:00.0: GPU serial number is 0324418141428.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1124.448571] NVRM: Xid (PCI:0000:07:00): 8, pid=3896, name=msedge, Channel 00000038
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963866] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963879] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963883] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963887] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963891] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981359] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981364] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981368] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981371] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981374] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981377] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998868] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998878] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998881] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998885] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998887] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016377] NVRM: Xid (PCI:0000:07:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016379] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016381] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016383] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016385] NVRM: GPU 0000:07:00.0: GPU serial number is 0324418141545.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016388] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016391] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016394] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016397] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033892] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033897] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033900] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033903] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033909] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033911] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051401] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051406] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051409] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051412] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051416] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051418] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:25 ovsdl-DGX-Station /usr/lib/gdm3/gdm-x-session[3042]: (EE) client bug: timer event2 debounce: offset negative (-923ms)
Jan 4 16:12:25 ovsdl-DGX-Station /usr/lib/gdm3/gdm-x-session[3042]: (EE) client bug: timer event2 debounce short: offset negative (-936ms)
Jan 4 16:12:25 ovsdl-DGX-Station kernel: [ 1129.949400] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c37e:0:0:0x0000000f

generix · January 5, 2023, 11:57am

Since it’s a DGX Station, I guess the gpu is broken. Or it became loose when the box has been moved around recently.

huxy7913 · January 6, 2023, 4:06am

Thanks for your help.
Now the log info turned out to be like follow:

Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989180] INFO: task gnome-shell:2722 blocked for more than 120 seconds.
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989186] Tainted: P OEL 4.15.0-65-generic #74-Ubuntu
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989188] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989191] gnome-shell D 0 2722 2456 0x00000000
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989195] Call Trace:
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989203] __schedule+0x24e/0x880
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989208] ? ttwu_do_wakeup+0x1e/0x140
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989211] schedule+0x2c/0x80
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989214] rwsem_down_write_failed+0x1ea/0x360
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989220] ? __wake_up_common+0x73/0x130
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989224] call_rwsem_down_write_failed+0x17/0x30
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989227] ? call_rwsem_down_write_failed+0x17/0x30
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989231] down_write+0x2d/0x40
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989457] os_acquire_rwlock_write+0x3b/0x50 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.989795] _nv038381rm+0xc/0x30 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990051] ? _nv039329rm+0x18d/0x1d0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990297] ? _nv041056rm+0x45/0xd0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990633] ? _nv041001rm+0x142/0x2b0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.990883] ? _nv039291rm+0x15a/0x2e0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991128] ? _nv039292rm+0x5b/0x90 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991370] ? _nv039292rm+0x31/0x90 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991616] ? _nv012677rm+0x1d/0x30 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.991862] ? _nv039307rm+0xb0/0xb0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992117] ? _nv012699rm+0x54/0x70 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992389] ? _nv011412rm+0xc4/0x120 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992644] ? _nv000657rm+0x63/0x70 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.992890] ? _nv000580rm+0x2c/0x40 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993245] ? _nv000694rm+0x86c/0xc80 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993616] ? rm_ioctl+0x54/0xb0 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993805] ? nvidia_ioctl+0x2dc/0x840 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.993998] ? nvidia_frontend_unlocked_ioctl+0x42/0x50 [nvidia]
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994007] ? do_vfs_ioctl+0xa8/0x630
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994014] ? __sys_recvmsg+0x80/0x90
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994020] ? SyS_ioctl+0x79/0x90
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994028] ? do_syscall_64+0x73/0x130
Jan 6 11:58:06 ovsdl-DGX-Station kernel: [ 1571.994035] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 6 11:58:18 ovsdl-DGX-Station kernel: [ 1584.341120] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [gnome-shell:3201]
Jan 6 11:58:18 ovsdl-DGX-Station kernel: [ 1584.341123] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc aufs overlay iptable_filter nls_iso8859_1 binfmt_misc snd_hda_codec_hdmi intel_rapl sb_edac x86_pkg_temp_thermal coretemp kvm snd_hda_codec_realtek eeepc_wmi irqbypass crct10dif_pclmul snd_hda_codec_generic snd_seq_midi asus_wmi crc32_pclmul snd_seq_midi_event sparse_keymap ghash_clmulni_intel video intel_wmi_thunderbolt wmi_bmof mxm_wmi pcbc input_leds joydev snd_rawmidi snd_hda_intel aesni_intel aes_x86_64 crypto_simd snd_seq glue_helper snd_hda_codec cryptd snd_hda_core intel_cstate snd_hwdep intel_rapl_perf snd_pcm snd_seq_device snd_timer mei_me

huxy7913 · January 6, 2023, 6:27am

The gpu1 overheats even I run my code on other gpus. And then the gpu1 fall of the bus. Do you have any advice for this situation?

generix · January 6, 2023, 11:38am

Sounds like a heat sink got loose. I can’t really help with that since the DGX Stations are custom designs by nvidia. Rather check with the vendor you got it from to send it in for repair.

huxy7913 · January 6, 2023, 1:43pm

Thanks for your help. Now when gpu1 overheats, it will make the system freeze. But other three gpus are good. How can I just use other three gpus so at least the system can work normally?

generix · January 6, 2023, 1:44pm

You can use nvidia-smi drain
https://unix.stackexchange.com/questions/654075/how-can-i-disable-and-later-re-enable-one-of-my-nvidia-gpus

Topic		Replies	Views
Something goes wrong with PCIe and Ubuntu freezes several times a day on dgx station v100 DGX Systems (Data Center) pcie , cuda , kernel , ubuntu	0	726	January 4, 2023
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	4261	April 26, 2025
GPU has fallen off the bus Linux linux , gpu , linux-driver	6	487	August 14, 2025
GPU has fallen of the bus Linux	15	8059	July 19, 2019
Unable to determine the device handle for GPU :GPU is lost Linux	10	32148	August 11, 2021
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	11713	March 18, 2025
PCIE Bus Error with two NVIDIA cards on Linux Linux	3	3115	October 14, 2021
570 Random Freeze: GPU has fallen off the bus Linux	8	1674	May 15, 2025
GPU has fallen off the bus (L40S) Linux cuda	1	398	September 24, 2025
Pascal Titan X's GPU's falling off the bus Linux	0	930	December 29, 2016

Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100

Related topics