410.66 crash and system freeze under heavy load (Xid 8, Xid 38)

$ inxi -b
System:    Host: RuRo-Desktop Kernel: 4.14.78-1-MANJARO x86_64 bits: 64 Desktop: N/A Distro: Manjaro Linux 
Machine:   Type: Desktop System: ASUS product: All Series v: N/A serial: N/A 
           Mobo: ASUSTeK model: MAXIMUS VI FORMULA v: Rev 1.xx serial: 130915507100123 
           BIOS: American Megatrends v: 0714 date: 07/09/2013 
CPU:       Quad Core: Intel Core i7-4770K type: MT MCP speed: 850 MHz min/max: 800/3900 MHz 
Graphics:  Device-1: NVIDIA GP102 [GeForce GTX 1080 Ti] driver: nvidia v: 410.66 
           Display: server: X.Org 1.20.2 driver: nvidia resolution: 1920x1080~60Hz 
           OpenGL: renderer: GeForce GTX 1080 Ti/PCIe/SSE2 v: 4.6.0 NVIDIA 410.66 
Network:   Device-1: Intel Ethernet I217-V driver: e1000e 
           Device-2: Broadcom Limited BCM4352 802.11ac Wireless Network Adapter driver: wl 
Drives:    Local Storage: total: 2.04 TiB used: 103.15 GiB (4.9%) 
Info:      Processes: 231 Uptime: 10m Memory: 15.60 GiB used: 2.29 GiB (14.7%) Shell: zsh inxi: 3.0.26

Sometimes, when the GPU is under heavy load, the system freezes and errors like this can be found in the journal:

Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-be978d5d-1916-4dde-78ab-6bbd52c29779
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: GPU Board Serial Number: 
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000003c
Oct 22 00:09:34 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 38, 0008 0000902d 00000000 00000000 00000000 00000000
Oct 22 00:09:34 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 38, 0008 0000902d 00000000 00000000 00000000 00000000
Oct 22 00:09:35 RuRo-Desktop kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Oct 22 00:09:57 RuRo-Desktop kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0]
Oct 22 00:09:57 RuRo-Desktop kernel: Modules linked in: rfcomm fuse input_leds bnep nct6775 hwmon_vid btusb btrtl btbcm btintel bluetooth intel_rapl razerkbd(O) ecdh_generic x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic joydev snd_hda_codec_hdmi mousedev kvm wl(PO) irqbypass crct10dif_pclmul crc32_pclmul eeepc_wmi ghash_clmulni_intel asus_wmi pcbc iTCO_wdt sparse_keymap aesni_intel iTCO_vendor_support led_class evdev aes_x86_64 wmi_bmof mxm_wmi crypto_simd mac_hid glue_helper cryptd snd_hda_intel snd_hda_codec intel_cstate cfg80211 intel_rapl_perf snd_hda_core snd_hwdep pcspkr snd_pcm i2c_i801 snd_timer eeprom rfkill e1000e snd soundcore mei_me lpc_ich mei ptp shpchp pps_core thermal fan video wmi intel_smartconnect pcc_cpufreq button sch_fq_codel uinput coretemp msr pci_stub
Oct 22 00:09:57 RuRo-Desktop kernel:  vboxpci(O) vboxnetflt(O) vboxnetadp(O) vboxdrv(O) sg crypto_user ip_tables x_tables hid_generic usbhid hid ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod ahci libahci xhci_pci libata ehci_pci xhci_hcd ehci_hcd crc32c_intel scsi_mod usbcore usb_common nvidia_drm(PO) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart nvidia_uvm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf ipmi_msghandler
Oct 22 00:09:57 RuRo-Desktop kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O    4.14.77-1-MANJARO #1
Oct 22 00:09:57 RuRo-Desktop kernel: Hardware name: ASUS All Series/MAXIMUS VI FORMULA, BIOS 0714 07/09/2013
Oct 22 00:09:57 RuRo-Desktop kernel: task: ffff8dbe4c65e580 task.stack: ffffacd001918000
Oct 22 00:09:57 RuRo-Desktop kernel: RIP: 0010:_nv030757rm+0x13/0x30 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: RSP: 0018:ffff8dbe5ec43a70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Oct 22 00:09:57 RuRo-Desktop kernel: RAX: 0000000000000000 RBX: 00000000132000a1 RCX: 0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: RDX: ffffacd009000000 RSI: ffff8dbe48648008 RDI: ffff8dbe4c35c808
Oct 22 00:09:57 RuRo-Desktop kernel: RBP: ffff8dbe47702a18 R08: ffff8dbe47a1cb48 R09: ffff8dbe47702a24
Oct 22 00:09:57 RuRo-Desktop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc04caf5e
Oct 22 00:09:57 RuRo-Desktop kernel: R13: ffff8dbe48648f60 R14: 0000000000000000 R15: 0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: FS:  0000000000000000(0000) GS:ffff8dbe5ec40000(0000) knlGS:0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 00:09:57 RuRo-Desktop kernel: CR2: 000016e8ea7d8000 CR3: 000000005400a002 CR4: 00000000001606e0
Oct 22 00:09:57 RuRo-Desktop kernel: Call Trace:
Oct 22 00:09:57 RuRo-Desktop kernel:  <IRQ>
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv021513rm+0xf8/0x130 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv026844rm+0x54/0x340 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv026842rm+0xfb/0x100 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv019479rm+0x57/0xb0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv007069rm+0x1bc/0x220 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv019290rm+0x91/0xb0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv018856rm+0xba/0x100 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv017137rm+0x1c6/0x230 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv018083rm+0xdc/0x120 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv017880rm+0xe4/0x150 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv017882rm+0x2a6/0x4a0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv022862rm+0xc66/0x10d0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv022669rm+0x1b7/0x310 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033641rm+0x22a/0x2f0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033594rm+0x267/0x470 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033594rm+0x238/0x470 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033595rm+0x6de/0x880 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033672rm+0x11d/0x150 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033674rm+0x49c/0x650 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033673rm+0x51/0x1c0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv030987rm+0x1c0/0x1d0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? rm_run_rc_callback+0x8b/0xe0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? nv_pci_register_driver+0x20/0x20 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? nvidia_rc_timer_callback+0x6f/0x90 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? call_timer_fn+0x30/0x130
Oct 22 00:09:57 RuRo-Desktop kernel:  ? nv_pci_register_driver+0x20/0x20 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? run_timer_softirq+0x40b/0x440
Oct 22 00:09:57 RuRo-Desktop kernel:  ? tick_sched_handle+0x23/0x60
Oct 22 00:09:57 RuRo-Desktop kernel:  ? tick_sched_timer+0x34/0x70
Oct 22 00:09:57 RuRo-Desktop kernel:  ? recalibrate_cpu_khz+0x10/0x10
Oct 22 00:09:57 RuRo-Desktop kernel:  ? __do_softirq+0xdf/0x2f7
Oct 22 00:09:57 RuRo-Desktop kernel:  ? irq_exit+0xb1/0xc0
Oct 22 00:09:57 RuRo-Desktop kernel:  ? smp_apic_timer_interrupt+0x78/0x160
Oct 22 00:09:57 RuRo-Desktop kernel:  ? apic_timer_interrupt+0x7d/0x90
Oct 22 00:09:57 RuRo-Desktop kernel:  </IRQ>
Oct 22 00:09:57 RuRo-Desktop kernel:  ? cpuidle_enter_state+0xb9/0x300
Oct 22 00:09:57 RuRo-Desktop kernel:  ? cpuidle_enter_state+0x94/0x300
Oct 22 00:09:57 RuRo-Desktop kernel:  ? do_idle+0x1a6/0x1d0
Oct 22 00:09:57 RuRo-Desktop kernel:  ? cpu_startup_entry+0x6f/0x80
Oct 22 00:09:57 RuRo-Desktop kernel:  ? start_secondary+0x1b5/0x210
Oct 22 00:09:57 RuRo-Desktop kernel:  ? secondary_startup_64+0xa5/0xb0
Oct 22 00:09:57 RuRo-Desktop kernel: Code: 31 ff e8 d1 14 00 00 48 89 c7 e8 e9 01 f9 ff 0f b7 c3 5b c3 0f 1f 40 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 a2 14 00 00 48 89 c7 e8 ba 01 f9 ff 89 d8

And then the same stack trace repeated about 20 more times. I sshed into my machine and tried running nvidia-bug-report.sh, but it also froze, so I had to kill it.

At first I thought, that this is a hardware bug, but “Xid 38” is documented as “Driver firmware error” here https://docs.nvidia.com/deploy/xid-errors/index.html.

I get this crash, when running a machine learning application with tensorflow-gpu. I tried reproducing this crash with gputest stress tests, but even with higher GPU usage, power draw and temperature no crash happened. The tensorflow-gpu application crashes consistently within 10-20 minutes. So maybe the type of load is important?
nvidia-bug-report.log.gz (45 KB)

Did you check if this applies
https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/

Hi, I tried adding Option “Interactive” “0” to my xorg.conf, however the system still freezes.
I don’t remember seeing any mentions of “RuntimeError: cuda runtime error (6)” anywhere.
Also, I think, Xid 8 is supposed to be a very generic error. They don’t mention having Xid 38 errors, so this might be an unrelated issue.

Also I updated the drivers to 410.73, which didn’t help. The GPU BIOS and Intel Microcode are up to date as far as I can tell.

Edit: Actually, maybe I was wrong. Turns out my xorg.conf Device settings were being overwritten, I’ll try again with Interactive 0 and accept your answer, if nothing breaks for a couple of days.

Edit Edit: Nope. The problem definitely still exists even with Interactive 0.

I installed Ubuntu 18.04 with nvidia 390.77 and the problem didn’t occur.
Could this be a regression in the newer nvidia 410+ drivers?

Which cuda versio are you using?

cuda 9.0

Update: After figuring out that the problem doesn’t occur on Ubuntu with nvidia 390.77, I downgraded the nvidia drivers to 390.87 on my Manjaro installation and the problem didn’t occur.

It seems to me, that this is indeed a regression of the 410+ drivers.
If I have free time I might try installing 410 drivers on Ubuntu to replicate the crash in a “supported” linux distro. Apart from nvidia-bug-report.sh, is there any more helpful information I should collect?

Yes, looks like some bug that makes cuda 9.0 incompatible with the cuda 10 drivers. nvidia-bug-report.log.gz should contain all the necessary logfiles. the output of queryDevice might be a useful addition. Maybe also mail it with a description of the bug to linux-bugs@nvidia.com

Damn, it just crashed on my Manjaro installation, even with older drivers, so it must be something else.

Ok, then maybe also consider a hardware fault, run cudamemcheck to test for memory faults and monitor temperatures using nvidia-smi.

I’ll try cudamemcheck, but hardware fault was actually the first thing that came to mind, so the first thing I did was running various stress tests and benchmarks.

Also, I captured various stats with

nvidia-smi daemon

including power usage, temperature, proc and mem clocks etc. And the power usage, GPU-Util and temperature were within reasonable bounds just before the crash. At least, I’ve seen much higher temperatures and power spikes during the stress tests.

Also, I was able to reproduce the crash under Ubuntu, but it also killed the sshd server for some reason, so I wasn’t able to run nvidia-bug-report.sh. Also, instead of containing stack traces, the logs are just cut off at the time of the crash in Ubuntu for some reason.

Actually, isn’t cuda-memcheck for validating how your program handles memory? Like valgrind? I see there are some hardware debugging capabilities, however it is not clear to me, how to use them.

My ML application runs on python/tensorflow do I need to compile it in some specific way? I am currently just running

cuda-memcheck python blah.py

But it’s complaining about CUDA_ERROR_NO_BINARY_FOR_GPU.

Also the program runs extremely slowly (duh) with the memory checker, so the GPU isn’t under a heavy load. If this is indeed a hardware fault, I doubt, this will crash.

If this is a hardware fault, do you know, if I can get some kind of official confirmation about it from Nvidia. I would like to exchange my GPU under warranty, but I’ve heard extremely negative things about Aorus rejecting warranties. I would imagine, it will be easier to convince them, if I have some kind of official statement from Nvidia, that the GPU is indeed faulty.

I’m sorry, meant cudamemtest.

I ran cudamemtest from here https://github.com/ComputationalRadiationPhysics/cuda_memtest for the last 15 hours and I don’t think any errors were found, no crash either. Here is a slice of the output:

[11/15/2018 01:04:09][ruro-UbuntuHome][0]:Running cuda memtest, version 1.2.3
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.77  Tue Jul 10 18:28:52 PDT 2018
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:num_gpus=1
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:Device name=GeForce GTX 1080 Ti, global memory size=11718230016, serial=unknown (no NVML found)
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:major=6, minor=1
[11/15/2018 01:04:10][ruro-UbuntuHome][0]:Attached to device 0 successfully.
[11/15/2018 01:04:10][ruro-UbuntuHome][0]:Allocated 10657 MB
[11/15/2018 01:04:10][ruro-UbuntuHome][0]:Test0 [Walking 1 bit]
[11/15/2018 01:04:14][ruro-UbuntuHome][0]:Test0 finished in 4.5 seconds
[11/15/2018 01:04:14][ruro-UbuntuHome][0]:Test1 [Own address test]
[11/15/2018 01:04:16][ruro-UbuntuHome][0]:Test1 finished in 2.0 seconds
[11/15/2018 01:04:16][ruro-UbuntuHome][0]:Test2 [Moving inversions, ones&zeros]
[11/15/2018 01:04:30][ruro-UbuntuHome][0]:Test2 finished in 14.2 seconds
[11/15/2018 01:04:30][ruro-UbuntuHome][0]:Test3 [Moving inversions, 8 bit pat]
[11/15/2018 01:04:44][ruro-UbuntuHome][0]:Test3 finished in 14.2 seconds
[11/15/2018 01:04:44][ruro-UbuntuHome][0]:Test4 [Moving inversions, random pattern]
[11/15/2018 01:04:52][ruro-UbuntuHome][0]:Test4 finished in 7.1 seconds
[11/15/2018 01:04:52][ruro-UbuntuHome][0]:Test5 [Block move, 64 moves]
[11/15/2018 01:04:54][ruro-UbuntuHome][0]:Test5 finished in 3.0 seconds
[11/15/2018 01:04:54][ruro-UbuntuHome][0]:Test6 [Moving inversions, 32 bit pat]
[11/15/2018 01:12:40][ruro-UbuntuHome][0]:Test6 finished in 465.9 seconds
[11/15/2018 01:12:40][ruro-UbuntuHome][0]:Test7 [Random number sequence]
[11/15/2018 01:12:51][ruro-UbuntuHome][0]:Test7 finished in 10.2 seconds
[11/15/2018 01:12:51][ruro-UbuntuHome][0]:Test8 [Modulo 20, random pattern]
[11/15/2018 01:12:51][ruro-UbuntuHome][0]:test8[mod test]: p1=0x46ca2519, p2=0xb935dae6
[11/15/2018 01:13:04][ruro-UbuntuHome][0]:Test8 finished in 13.9 seconds
[11/15/2018 01:13:04][ruro-UbuntuHome][0]:Test10 [Memory stress test]
[11/15/2018 01:13:04][ruro-UbuntuHome][0]:Test10 with pattern=0x37e32f7c71559f31
[11/15/2018 01:13:26][ruro-UbuntuHome][0]:Test10 finished in 21.5 seconds

The rest of the output is just more repetitions of Test0-10 messages, no errors or anything.