$ inxi -b
System: Host: RuRo-Desktop Kernel: 4.14.78-1-MANJARO x86_64 bits: 64 Desktop: N/A Distro: Manjaro Linux
Machine: Type: Desktop System: ASUS product: All Series v: N/A serial: N/A
Mobo: ASUSTeK model: MAXIMUS VI FORMULA v: Rev 1.xx serial: 130915507100123
BIOS: American Megatrends v: 0714 date: 07/09/2013
CPU: Quad Core: Intel Core i7-4770K type: MT MCP speed: 850 MHz min/max: 800/3900 MHz
Graphics: Device-1: NVIDIA GP102 [GeForce GTX 1080 Ti] driver: nvidia v: 410.66
Display: server: X.Org 1.20.2 driver: nvidia resolution: 1920x1080~60Hz
OpenGL: renderer: GeForce GTX 1080 Ti/PCIe/SSE2 v: 4.6.0 NVIDIA 410.66
Network: Device-1: Intel Ethernet I217-V driver: e1000e
Device-2: Broadcom Limited BCM4352 802.11ac Wireless Network Adapter driver: wl
Drives: Local Storage: total: 2.04 TiB used: 103.15 GiB (4.9%)
Info: Processes: 231 Uptime: 10m Memory: 15.60 GiB used: 2.29 GiB (14.7%) Shell: zsh inxi: 3.0.26
Sometimes, when the GPU is under heavy load, the system freezes and errors like this can be found in the journal:
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-be978d5d-1916-4dde-78ab-6bbd52c29779
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: GPU Board Serial Number:
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000003c
Oct 22 00:09:34 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 38, 0008 0000902d 00000000 00000000 00000000 00000000
Oct 22 00:09:34 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 38, 0008 0000902d 00000000 00000000 00000000 00000000
Oct 22 00:09:35 RuRo-Desktop kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Oct 22 00:09:57 RuRo-Desktop kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0]
Oct 22 00:09:57 RuRo-Desktop kernel: Modules linked in: rfcomm fuse input_leds bnep nct6775 hwmon_vid btusb btrtl btbcm btintel bluetooth intel_rapl razerkbd(O) ecdh_generic x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic joydev snd_hda_codec_hdmi mousedev kvm wl(PO) irqbypass crct10dif_pclmul crc32_pclmul eeepc_wmi ghash_clmulni_intel asus_wmi pcbc iTCO_wdt sparse_keymap aesni_intel iTCO_vendor_support led_class evdev aes_x86_64 wmi_bmof mxm_wmi crypto_simd mac_hid glue_helper cryptd snd_hda_intel snd_hda_codec intel_cstate cfg80211 intel_rapl_perf snd_hda_core snd_hwdep pcspkr snd_pcm i2c_i801 snd_timer eeprom rfkill e1000e snd soundcore mei_me lpc_ich mei ptp shpchp pps_core thermal fan video wmi intel_smartconnect pcc_cpufreq button sch_fq_codel uinput coretemp msr pci_stub
Oct 22 00:09:57 RuRo-Desktop kernel: vboxpci(O) vboxnetflt(O) vboxnetadp(O) vboxdrv(O) sg crypto_user ip_tables x_tables hid_generic usbhid hid ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod ahci libahci xhci_pci libata ehci_pci xhci_hcd ehci_hcd crc32c_intel scsi_mod usbcore usb_common nvidia_drm(PO) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart nvidia_uvm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf ipmi_msghandler
Oct 22 00:09:57 RuRo-Desktop kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: P O 4.14.77-1-MANJARO #1
Oct 22 00:09:57 RuRo-Desktop kernel: Hardware name: ASUS All Series/MAXIMUS VI FORMULA, BIOS 0714 07/09/2013
Oct 22 00:09:57 RuRo-Desktop kernel: task: ffff8dbe4c65e580 task.stack: ffffacd001918000
Oct 22 00:09:57 RuRo-Desktop kernel: RIP: 0010:_nv030757rm+0x13/0x30 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: RSP: 0018:ffff8dbe5ec43a70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Oct 22 00:09:57 RuRo-Desktop kernel: RAX: 0000000000000000 RBX: 00000000132000a1 RCX: 0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: RDX: ffffacd009000000 RSI: ffff8dbe48648008 RDI: ffff8dbe4c35c808
Oct 22 00:09:57 RuRo-Desktop kernel: RBP: ffff8dbe47702a18 R08: ffff8dbe47a1cb48 R09: ffff8dbe47702a24
Oct 22 00:09:57 RuRo-Desktop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc04caf5e
Oct 22 00:09:57 RuRo-Desktop kernel: R13: ffff8dbe48648f60 R14: 0000000000000000 R15: 0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: FS: 0000000000000000(0000) GS:ffff8dbe5ec40000(0000) knlGS:0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 00:09:57 RuRo-Desktop kernel: CR2: 000016e8ea7d8000 CR3: 000000005400a002 CR4: 00000000001606e0
Oct 22 00:09:57 RuRo-Desktop kernel: Call Trace:
Oct 22 00:09:57 RuRo-Desktop kernel: <IRQ>
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv021513rm+0xf8/0x130 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv026844rm+0x54/0x340 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv026842rm+0xfb/0x100 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv019479rm+0x57/0xb0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv007069rm+0x1bc/0x220 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv019290rm+0x91/0xb0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv018856rm+0xba/0x100 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv017137rm+0x1c6/0x230 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv018083rm+0xdc/0x120 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv017880rm+0xe4/0x150 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv017882rm+0x2a6/0x4a0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv022862rm+0xc66/0x10d0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv022669rm+0x1b7/0x310 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033641rm+0x22a/0x2f0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033594rm+0x267/0x470 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033594rm+0x238/0x470 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033595rm+0x6de/0x880 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033672rm+0x11d/0x150 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033674rm+0x49c/0x650 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv033673rm+0x51/0x1c0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? _nv030987rm+0x1c0/0x1d0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? rm_run_rc_callback+0x8b/0xe0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? nv_pci_register_driver+0x20/0x20 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? nvidia_rc_timer_callback+0x6f/0x90 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? call_timer_fn+0x30/0x130
Oct 22 00:09:57 RuRo-Desktop kernel: ? nv_pci_register_driver+0x20/0x20 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: ? run_timer_softirq+0x40b/0x440
Oct 22 00:09:57 RuRo-Desktop kernel: ? tick_sched_handle+0x23/0x60
Oct 22 00:09:57 RuRo-Desktop kernel: ? tick_sched_timer+0x34/0x70
Oct 22 00:09:57 RuRo-Desktop kernel: ? recalibrate_cpu_khz+0x10/0x10
Oct 22 00:09:57 RuRo-Desktop kernel: ? __do_softirq+0xdf/0x2f7
Oct 22 00:09:57 RuRo-Desktop kernel: ? irq_exit+0xb1/0xc0
Oct 22 00:09:57 RuRo-Desktop kernel: ? smp_apic_timer_interrupt+0x78/0x160
Oct 22 00:09:57 RuRo-Desktop kernel: ? apic_timer_interrupt+0x7d/0x90
Oct 22 00:09:57 RuRo-Desktop kernel: </IRQ>
Oct 22 00:09:57 RuRo-Desktop kernel: ? cpuidle_enter_state+0xb9/0x300
Oct 22 00:09:57 RuRo-Desktop kernel: ? cpuidle_enter_state+0x94/0x300
Oct 22 00:09:57 RuRo-Desktop kernel: ? do_idle+0x1a6/0x1d0
Oct 22 00:09:57 RuRo-Desktop kernel: ? cpu_startup_entry+0x6f/0x80
Oct 22 00:09:57 RuRo-Desktop kernel: ? start_secondary+0x1b5/0x210
Oct 22 00:09:57 RuRo-Desktop kernel: ? secondary_startup_64+0xa5/0xb0
Oct 22 00:09:57 RuRo-Desktop kernel: Code: 31 ff e8 d1 14 00 00 48 89 c7 e8 e9 01 f9 ff 0f b7 c3 5b c3 0f 1f 40 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 a2 14 00 00 48 89 c7 e8 ba 01 f9 ff 89 d8
And then the same stack trace repeated about 20 more times. I sshed into my machine and tried running nvidia-bug-report.sh, but it also froze, so I had to kill it.
At first I thought, that this is a hardware bug, but “Xid 38” is documented as “Driver firmware error” here https://docs.nvidia.com/deploy/xid-errors/index.html.
I get this crash, when running a machine learning application with tensorflow-gpu. I tried reproducing this crash with gputest stress tests, but even with higher GPU usage, power draw and temperature no crash happened. The tensorflow-gpu application crashes consistently within 10-20 minutes. So maybe the type of load is important?
nvidia-bug-report.log.gz (45 KB)