410.66 crash and system freeze under heavy load (Xid 8, Xid 38)

ruro.ruro · October 28, 2018, 12:57pm

$ inxi -b
System:    Host: RuRo-Desktop Kernel: 4.14.78-1-MANJARO x86_64 bits: 64 Desktop: N/A Distro: Manjaro Linux 
Machine:   Type: Desktop System: ASUS product: All Series v: N/A serial: N/A 
           Mobo: ASUSTeK model: MAXIMUS VI FORMULA v: Rev 1.xx serial: 130915507100123 
           BIOS: American Megatrends v: 0714 date: 07/09/2013 
CPU:       Quad Core: Intel Core i7-4770K type: MT MCP speed: 850 MHz min/max: 800/3900 MHz 
Graphics:  Device-1: NVIDIA GP102 [GeForce GTX 1080 Ti] driver: nvidia v: 410.66 
           Display: server: X.Org 1.20.2 driver: nvidia resolution: 1920x1080~60Hz 
           OpenGL: renderer: GeForce GTX 1080 Ti/PCIe/SSE2 v: 4.6.0 NVIDIA 410.66 
Network:   Device-1: Intel Ethernet I217-V driver: e1000e 
           Device-2: Broadcom Limited BCM4352 802.11ac Wireless Network Adapter driver: wl 
Drives:    Local Storage: total: 2.04 TiB used: 103.15 GiB (4.9%) 
Info:      Processes: 231 Uptime: 10m Memory: 15.60 GiB used: 2.29 GiB (14.7%) Shell: zsh inxi: 3.0.26

Sometimes, when the GPU is under heavy load, the system freezes and errors like this can be found in the journal:

Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-be978d5d-1916-4dde-78ab-6bbd52c29779
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: GPU Board Serial Number: 
Oct 22 00:09:33 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000003c
Oct 22 00:09:34 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 38, 0008 0000902d 00000000 00000000 00000000 00000000
Oct 22 00:09:34 RuRo-Desktop kernel: NVRM: Xid (PCI:0000:01:00): 38, 0008 0000902d 00000000 00000000 00000000 00000000
Oct 22 00:09:35 RuRo-Desktop kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Oct 22 00:09:57 RuRo-Desktop kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0]
Oct 22 00:09:57 RuRo-Desktop kernel: Modules linked in: rfcomm fuse input_leds bnep nct6775 hwmon_vid btusb btrtl btbcm btintel bluetooth intel_rapl razerkbd(O) ecdh_generic x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic joydev snd_hda_codec_hdmi mousedev kvm wl(PO) irqbypass crct10dif_pclmul crc32_pclmul eeepc_wmi ghash_clmulni_intel asus_wmi pcbc iTCO_wdt sparse_keymap aesni_intel iTCO_vendor_support led_class evdev aes_x86_64 wmi_bmof mxm_wmi crypto_simd mac_hid glue_helper cryptd snd_hda_intel snd_hda_codec intel_cstate cfg80211 intel_rapl_perf snd_hda_core snd_hwdep pcspkr snd_pcm i2c_i801 snd_timer eeprom rfkill e1000e snd soundcore mei_me lpc_ich mei ptp shpchp pps_core thermal fan video wmi intel_smartconnect pcc_cpufreq button sch_fq_codel uinput coretemp msr pci_stub
Oct 22 00:09:57 RuRo-Desktop kernel:  vboxpci(O) vboxnetflt(O) vboxnetadp(O) vboxdrv(O) sg crypto_user ip_tables x_tables hid_generic usbhid hid ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod ahci libahci xhci_pci libata ehci_pci xhci_hcd ehci_hcd crc32c_intel scsi_mod usbcore usb_common nvidia_drm(PO) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart nvidia_uvm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf ipmi_msghandler
Oct 22 00:09:57 RuRo-Desktop kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O    4.14.77-1-MANJARO #1
Oct 22 00:09:57 RuRo-Desktop kernel: Hardware name: ASUS All Series/MAXIMUS VI FORMULA, BIOS 0714 07/09/2013
Oct 22 00:09:57 RuRo-Desktop kernel: task: ffff8dbe4c65e580 task.stack: ffffacd001918000
Oct 22 00:09:57 RuRo-Desktop kernel: RIP: 0010:_nv030757rm+0x13/0x30 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel: RSP: 0018:ffff8dbe5ec43a70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Oct 22 00:09:57 RuRo-Desktop kernel: RAX: 0000000000000000 RBX: 00000000132000a1 RCX: 0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: RDX: ffffacd009000000 RSI: ffff8dbe48648008 RDI: ffff8dbe4c35c808
Oct 22 00:09:57 RuRo-Desktop kernel: RBP: ffff8dbe47702a18 R08: ffff8dbe47a1cb48 R09: ffff8dbe47702a24
Oct 22 00:09:57 RuRo-Desktop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc04caf5e
Oct 22 00:09:57 RuRo-Desktop kernel: R13: ffff8dbe48648f60 R14: 0000000000000000 R15: 0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: FS:  0000000000000000(0000) GS:ffff8dbe5ec40000(0000) knlGS:0000000000000000
Oct 22 00:09:57 RuRo-Desktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 00:09:57 RuRo-Desktop kernel: CR2: 000016e8ea7d8000 CR3: 000000005400a002 CR4: 00000000001606e0
Oct 22 00:09:57 RuRo-Desktop kernel: Call Trace:
Oct 22 00:09:57 RuRo-Desktop kernel:  <IRQ>
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv021513rm+0xf8/0x130 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv026844rm+0x54/0x340 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv026842rm+0xfb/0x100 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv019479rm+0x57/0xb0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv007069rm+0x1bc/0x220 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv019290rm+0x91/0xb0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv018856rm+0xba/0x100 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv017137rm+0x1c6/0x230 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv018083rm+0xdc/0x120 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv017880rm+0xe4/0x150 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv017882rm+0x2a6/0x4a0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv022862rm+0xc66/0x10d0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv022669rm+0x1b7/0x310 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033641rm+0x22a/0x2f0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033594rm+0x267/0x470 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033594rm+0x238/0x470 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033595rm+0x6de/0x880 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033672rm+0x11d/0x150 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033674rm+0x49c/0x650 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv033673rm+0x51/0x1c0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? _nv030987rm+0x1c0/0x1d0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? rm_run_rc_callback+0x8b/0xe0 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? nv_pci_register_driver+0x20/0x20 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? nvidia_rc_timer_callback+0x6f/0x90 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? call_timer_fn+0x30/0x130
Oct 22 00:09:57 RuRo-Desktop kernel:  ? nv_pci_register_driver+0x20/0x20 [nvidia]
Oct 22 00:09:57 RuRo-Desktop kernel:  ? run_timer_softirq+0x40b/0x440
Oct 22 00:09:57 RuRo-Desktop kernel:  ? tick_sched_handle+0x23/0x60
Oct 22 00:09:57 RuRo-Desktop kernel:  ? tick_sched_timer+0x34/0x70
Oct 22 00:09:57 RuRo-Desktop kernel:  ? recalibrate_cpu_khz+0x10/0x10
Oct 22 00:09:57 RuRo-Desktop kernel:  ? __do_softirq+0xdf/0x2f7
Oct 22 00:09:57 RuRo-Desktop kernel:  ? irq_exit+0xb1/0xc0
Oct 22 00:09:57 RuRo-Desktop kernel:  ? smp_apic_timer_interrupt+0x78/0x160
Oct 22 00:09:57 RuRo-Desktop kernel:  ? apic_timer_interrupt+0x7d/0x90
Oct 22 00:09:57 RuRo-Desktop kernel:  </IRQ>
Oct 22 00:09:57 RuRo-Desktop kernel:  ? cpuidle_enter_state+0xb9/0x300
Oct 22 00:09:57 RuRo-Desktop kernel:  ? cpuidle_enter_state+0x94/0x300
Oct 22 00:09:57 RuRo-Desktop kernel:  ? do_idle+0x1a6/0x1d0
Oct 22 00:09:57 RuRo-Desktop kernel:  ? cpu_startup_entry+0x6f/0x80
Oct 22 00:09:57 RuRo-Desktop kernel:  ? start_secondary+0x1b5/0x210
Oct 22 00:09:57 RuRo-Desktop kernel:  ? secondary_startup_64+0xa5/0xb0
Oct 22 00:09:57 RuRo-Desktop kernel: Code: 31 ff e8 d1 14 00 00 48 89 c7 e8 e9 01 f9 ff 0f b7 c3 5b c3 0f 1f 40 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 a2 14 00 00 48 89 c7 e8 ba 01 f9 ff 89 d8

And then the same stack trace repeated about 20 more times. I sshed into my machine and tried running nvidia-bug-report.sh, but it also froze, so I had to kill it.

At first I thought, that this is a hardware bug, but “Xid 38” is documented as “Driver firmware error” here https://docs.nvidia.com/deploy/xid-errors/index.html.

I get this crash, when running a machine learning application with tensorflow-gpu. I tried reproducing this crash with gputest stress tests, but even with higher GPU usage, power draw and temperature no crash happened. The tensorflow-gpu application crashes consistently within 10-20 minutes. So maybe the type of load is important?
nvidia-bug-report.log.gz (45 KB)

generix · October 30, 2018, 11:51am

Did you check if this applies
[url]https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/[/url]

ruro.ruro · October 30, 2018, 6:34pm

Hi, I tried adding Option “Interactive” “0” to my xorg.conf, however the system still freezes.
I don’t remember seeing any mentions of “RuntimeError: cuda runtime error (6)” anywhere.
Also, I think, Xid 8 is supposed to be a very generic error. They don’t mention having Xid 38 errors, so this might be an unrelated issue.

Also I updated the drivers to 410.73, which didn’t help. The GPU BIOS and Intel Microcode are up to date as far as I can tell.

Edit: Actually, maybe I was wrong. Turns out my xorg.conf Device settings were being overwritten, I’ll try again with Interactive 0 and accept your answer, if nothing breaks for a couple of days.

Edit Edit: Nope. The problem definitely still exists even with Interactive 0.

ruro.ruro · November 11, 2018, 8:59am

I installed Ubuntu 18.04 with nvidia 390.77 and the problem didn’t occur.
Could this be a regression in the newer nvidia 410+ drivers?

generix · November 11, 2018, 2:51pm

Which cuda versio are you using?

ruro.ruro · November 11, 2018, 2:55pm

cuda 9.0

ruro.ruro · November 12, 2018, 4:04pm

Update: After figuring out that the problem doesn’t occur on Ubuntu with nvidia 390.77, I downgraded the nvidia drivers to 390.87 on my Manjaro installation and the problem didn’t occur.

It seems to me, that this is indeed a regression of the 410+ drivers.
If I have free time I might try installing 410 drivers on Ubuntu to replicate the crash in a “supported” linux distro. Apart from nvidia-bug-report.sh, is there any more helpful information I should collect?

generix · November 12, 2018, 4:52pm

Yes, looks like some bug that makes cuda 9.0 incompatible with the cuda 10 drivers. nvidia-bug-report.log.gz should contain all the necessary logfiles. the output of queryDevice might be a useful addition. Maybe also mail it with a description of the bug to linux-bugs@nvidia.com

ruro.ruro · November 14, 2018, 4:58am

Damn, it just crashed on my Manjaro installation, even with older drivers, so it must be something else.

generix · November 14, 2018, 11:53am

Ok, then maybe also consider a hardware fault, run cudamemcheck to test for memory faults and monitor temperatures using nvidia-smi.

ruro.ruro · November 14, 2018, 3:28pm

I’ll try cudamemcheck, but hardware fault was actually the first thing that came to mind, so the first thing I did was running various stress tests and benchmarks.

Also, I captured various stats with

nvidia-smi daemon

including power usage, temperature, proc and mem clocks etc. And the power usage, GPU-Util and temperature were within reasonable bounds just before the crash. At least, I’ve seen much higher temperatures and power spikes during the stress tests.

Also, I was able to reproduce the crash under Ubuntu, but it also killed the sshd server for some reason, so I wasn’t able to run nvidia-bug-report.sh. Also, instead of containing stack traces, the logs are just cut off at the time of the crash in Ubuntu for some reason.

ruro.ruro · November 14, 2018, 8:12pm

Actually, isn’t cuda-memcheck for validating how your program handles memory? Like valgrind? I see there are some hardware debugging capabilities, however it is not clear to me, how to use them.

My ML application runs on python/tensorflow do I need to compile it in some specific way? I am currently just running

cuda-memcheck python blah.py

But it’s complaining about CUDA_ERROR_NO_BINARY_FOR_GPU.

Also the program runs extremely slowly (duh) with the memory checker, so the GPU isn’t under a heavy load. If this is indeed a hardware fault, I doubt, this will crash.

If this is a hardware fault, do you know, if I can get some kind of official confirmation about it from Nvidia. I would like to exchange my GPU under warranty, but I’ve heard extremely negative things about Aorus rejecting warranties. I would imagine, it will be easier to convince them, if I have some kind of official statement from Nvidia, that the GPU is indeed faulty.

generix · November 14, 2018, 9:26pm

I’m sorry, meant cudamemtest.

ruro.ruro · November 15, 2018, 2:31pm

I ran cudamemtest from here https://github.com/ComputationalRadiationPhysics/cuda_memtest for the last 15 hours and I don’t think any errors were found, no crash either. Here is a slice of the output:

[11/15/2018 01:04:09][ruro-UbuntuHome][0]:Running cuda memtest, version 1.2.3
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.77  Tue Jul 10 18:28:52 PDT 2018
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:num_gpus=1
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:Device name=GeForce GTX 1080 Ti, global memory size=11718230016, serial=unknown (no NVML found)
[11/15/2018 01:04:09][ruro-UbuntuHome][0]:major=6, minor=1
[11/15/2018 01:04:10][ruro-UbuntuHome][0]:Attached to device 0 successfully.
[11/15/2018 01:04:10][ruro-UbuntuHome][0]:Allocated 10657 MB
[11/15/2018 01:04:10][ruro-UbuntuHome][0]:Test0 [Walking 1 bit]
[11/15/2018 01:04:14][ruro-UbuntuHome][0]:Test0 finished in 4.5 seconds
[11/15/2018 01:04:14][ruro-UbuntuHome][0]:Test1 [Own address test]
[11/15/2018 01:04:16][ruro-UbuntuHome][0]:Test1 finished in 2.0 seconds
[11/15/2018 01:04:16][ruro-UbuntuHome][0]:Test2 [Moving inversions, ones&zeros]
[11/15/2018 01:04:30][ruro-UbuntuHome][0]:Test2 finished in 14.2 seconds
[11/15/2018 01:04:30][ruro-UbuntuHome][0]:Test3 [Moving inversions, 8 bit pat]
[11/15/2018 01:04:44][ruro-UbuntuHome][0]:Test3 finished in 14.2 seconds
[11/15/2018 01:04:44][ruro-UbuntuHome][0]:Test4 [Moving inversions, random pattern]
[11/15/2018 01:04:52][ruro-UbuntuHome][0]:Test4 finished in 7.1 seconds
[11/15/2018 01:04:52][ruro-UbuntuHome][0]:Test5 [Block move, 64 moves]
[11/15/2018 01:04:54][ruro-UbuntuHome][0]:Test5 finished in 3.0 seconds
[11/15/2018 01:04:54][ruro-UbuntuHome][0]:Test6 [Moving inversions, 32 bit pat]
[11/15/2018 01:12:40][ruro-UbuntuHome][0]:Test6 finished in 465.9 seconds
[11/15/2018 01:12:40][ruro-UbuntuHome][0]:Test7 [Random number sequence]
[11/15/2018 01:12:51][ruro-UbuntuHome][0]:Test7 finished in 10.2 seconds
[11/15/2018 01:12:51][ruro-UbuntuHome][0]:Test8 [Modulo 20, random pattern]
[11/15/2018 01:12:51][ruro-UbuntuHome][0]:test8[mod test]: p1=0x46ca2519, p2=0xb935dae6
[11/15/2018 01:13:04][ruro-UbuntuHome][0]:Test8 finished in 13.9 seconds
[11/15/2018 01:13:04][ruro-UbuntuHome][0]:Test10 [Memory stress test]
[11/15/2018 01:13:04][ruro-UbuntuHome][0]:Test10 with pattern=0x37e32f7c71559f31
[11/15/2018 01:13:26][ruro-UbuntuHome][0]:Test10 finished in 21.5 seconds

The rest of the output is just more repetitions of Test0-10 messages, no errors or anything.

Topic		Replies	Views
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	50990	December 16, 2015
deviceQuery passes and then fails CUDA Setup and Installation	4	2146	July 6, 2016
Install Problem CUDA Programming and Performance	32	12706	December 17, 2009
Linux Kernel Crashes under 260.19.21 Investigating Linux Kernel Crashes CUDA Programming and Performance	35	37589	February 1, 2011
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	8	1220	December 12, 2024
Dual GeForce GTX or Titan V on mobo, unable to display upon launching Ubuntu 18.04 Linux	35	3058	April 10, 2020
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78083	April 1, 2012
High CPU usage on xorg when the external monitor is plugged in Linux	120	38434	June 21, 2023
Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic Linux	23	2166	February 25, 2021
Nvidia-uvm module bug on suspend Linux	14	1735	December 7, 2023

410.66 crash and system freeze under heavy load (Xid 8, Xid 38)

Related topics