fike
December 17, 2015, 5:41pm
1
On a CentOS 6.7 system with several K40c’s that has been working consistently for quite sometime is now failing. Attempts to use NVIDIA/CUDA apps crash the system causing a reboot, this happens by just running nvidia-smi or nvidia-bug-report.sh for example. The last and only thing sent to messages is;
kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
I have tried several standard kernels with several recent driver versions and the issue remains.
I have seen this error in searches with a variety of explanations, one being faulty hardware.
Would anyone have any suggestions on finding the cause of this issue and perhaps a way to resolve it?
Thanks,
fike
December 17, 2015, 9:11pm
2
Isolated and removed bad GPU, issue resolved.
Hi fike (NVIDIA),
Could you please elaborate on how you isolated the hardware? I have 2xGTX980 using 352.55 and Linux 4.1.1. If I am to file a support request to replace hardware, I will need Nvidia to prove there is hardware fault, so I can tell vendor.
NVIDIA can you please provide a debugging version of nvidia_uvm? As nvidia_uvm threw the fault, on cpu 0 (of 31).
Regards.
PAH.
Here is the dmesg of the crash (with 32 cpus it can take a while to crash…).
[610034.645033] NVRM: GPU at PCI:0000:03:00: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
[610034.645066] NVRM: GPU Board Serial Number: xxxxxxxxxxxxx
[610034.645082] NVRM: Xid (PCI:0000:03:00): 8, Channel 00000033
[610036.654963] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[610046.847653] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 8, t=5253 jiffies, g=36605876, c=36605875, q=6891)
[610046.859330] Task dump for CPU 0:
[610046.859333] swapper/0 R running task 0 0 0 0x00000008
[610046.859337] 0000000000000010 0000000000000246 ffffffff81803e98 0000000000000018
[610046.859340] ffffffff8142c118 0000000000000000 0000000000000000 0000000000000092
[610046.859342] ffffffff818f1e80 ffff88101fd77200 ffffffffa08c9090 ffffffff81800000
[610046.859345] Call Trace:
[610046.859355] [<ffffffff8142c118>] ? cpuidle_enter_state+0x78/0x1f0
[610046.859376] [<ffffffff810ad04d>] ? cpu_startup_entry+0x33d/0x3d0
[610046.859381] [<ffffffff8191708c>] ? start_kernel+0x488/0x493
[610046.859383] [<ffffffff81916a0f>] ? set_init_arg+0x4e/0x4e
[610046.859385] [<ffffffff81916120>] ? early_idt_handler_array+0x120/0x120
[610046.859387] [<ffffffff81916120>] ? early_idt_handler_array+0x120/0x120
[610046.859390] [<ffffffff8191671d>] ? x86_64_start_kernel+0x149/0x158
[610051.751706] ------------[ cut here ]------------
[610051.751724] WARNING: CPU: 0 PID: 0 at kernel/watchdog.c:304 watchdog_overflow_callback+0x9a/0xd0()
[610051.751728] Watchdog detected hard LOCKUP on cpu 0
[610051.751731] Modules linked in:
[610051.751735] nvidia_uvm(PO) rfcomm hid_magicmouse hidp nvidia(PO) nfsv3 binfmt_misc cpufreq_stats cpufreq_powersave cpufreq_userspace cpufreq_conservative bnep tun nfc 8021q garp mrp stp llc nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache snd_hda_codec_hdmi snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep sp5100_tco btusb btbcm btintel kvm_amd bluetooth evdev joydev kvm psmouse snd_pcm mgag200 snd_timer snd serio_raw amd64_edac_mod rfkill soundcore edac_mce_amd ttm pcspkr drm_kms_helper edac_core fam15h_power k10temp i2c_piix4 sg 8250_fintek acpi_cpufreq tpm_tis tpm shpchp button processor thermal_sys md_mod igb ptp pps_core dca i2c_algo_bit loop fuse parport_pc ppdev lp parport ext4 crc16 mbcache jbd2 btrfs xor raid6_pq
[610051.751832] usb_storage hid_generic usbhid hid ata_generic sr_mod cdrom sd_mod ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 ahci lrw gf128mul glue_helper ablk_helper libahci pata_atiixp mpt3sas cryptd raid_class ohci_hcd ehci_pci scsi_transport_sas libata ehci_hcd drm usbcore scsi_mod usb_common i2c_core zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) ipmi_watchdog dm_mirror dm_region_hash dm_log dm_mod ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 [last unloaded: nvidia]
[610051.751893] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 4.1.1 #1
[610051.751897] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.5 11/25/2013
[610051.751900] 0000000000000000 ffffffff81722646 ffffffff815589ac ffff881027c05b90
[610051.751906] ffffffff8106f241 ffff881021743000 0000000000000000 ffff881027c05c80
[610051.751912] 0000000000000000 ffff881027c05ef8 ffffffff8106f2ba ffffffff81719470
[610051.751917] Call Trace:
[610051.751920] <NMI> [<ffffffff815589ac>] ? dump_stack+0x40/0x50
[610051.751934] [<ffffffff8106f241>] ? warn_slowpath_common+0x81/0xb0
[610051.751939] [<ffffffff8106f2ba>] ? warn_slowpath_fmt+0x4a/0x50
[610051.751946] [<ffffffff8110ea4a>] ? watchdog_overflow_callback+0x9a/0xd0
[610051.751953] [<ffffffff8114b2f6>] ? __perf_event_overflow+0x86/0x220
[610051.751959] [<ffffffff8102a747>] ? x86_perf_event_set_period+0xd7/0x160
[610051.751964] [<ffffffff8102ad15>] ? x86_pmu_handle_irq+0x125/0x170
[610051.751971] [<ffffffff812cfa61>] ? ioremap_page_range+0x281/0x3f0
[610051.751979] [<ffffffff8118ea63>] ? vunmap_page_range+0x1c3/0x2d0
[610051.751987] [<ffffffff81379031>] ? ghes_copy_tofrom_phys+0x121/0x200
[610051.751993] [<ffffffff81379181>] ? ghes_read_estatus+0x71/0x150
[610051.751999] [<ffffffff8102931a>] ? perf_event_nmi_handler+0x2a/0x50
[610051.752003] [<ffffffff81017cce>] ? nmi_handle+0x8e/0x120
[610051.752008] [<ffffffff81018200>] ? default_do_nmi+0x40/0x110
[610051.752012] [<ffffffff81018350>] ? do_nmi+0x80/0xc0
[610051.752017] [<ffffffff81560aef>] ? end_repeat_nmi+0x1e/0x2e
[610051.752037] [<ffffffffa04ff392>] ? io_watchdog_func+0x102/0x3a0 [ohci_hcd]
[610051.752053] [<ffffffffa04ff392>] ? io_watchdog_func+0x102/0x3a0 [ohci_hcd]
[610051.752068] [<ffffffffa04ff392>] ? io_watchdog_func+0x102/0x3a0 [ohci_hcd]
[610051.752070] <<EOE>> <IRQ> [<ffffffffa04ff290>] ? ohci_dump+0x80/0x80 [ohci_hcd]
[610051.752092] [<ffffffff810cf9b0>] ? call_timer_fn+0x30/0x100
[610051.752107] [<ffffffffa04ff290>] ? ohci_dump+0x80/0x80 [ohci_hcd]
[610051.752111] [<ffffffff810d1349>] ? run_timer_softirq+0x209/0x2f0
[610051.752117] [<ffffffff81073450>] ? __do_softirq+0xe0/0x260
[610051.752122] [<ffffffff81073825>] ? irq_exit+0x95/0xa0
[610051.752128] [<ffffffff8156156e>] ? smp_apic_timer_interrupt+0x3e/0x50
[610051.752133] [<ffffffff8155f67e>] ? apic_timer_interrupt+0x6e/0x80
[610051.752136] <EOI> [<ffffffff8142c145>] ? cpuidle_enter_state+0xa5/0x1f0
[610051.752150] [<ffffffff8142c118>] ? cpuidle_enter_state+0x78/0x1f0
[610051.752159] [<ffffffff810ad04d>] ? cpu_startup_entry+0x33d/0x3d0
[610051.752165] [<ffffffff8191708c>] ? start_kernel+0x488/0x493
[610051.752170] [<ffffffff81916a0f>] ? set_init_arg+0x4e/0x4e
[610051.752175] [<ffffffff81916120>] ? early_idt_handler_array+0x120/0x120
[610051.752179] [<ffffffff81916120>] ? early_idt_handler_array+0x120/0x120
[610051.752184] [<ffffffff8191671d>] ? x86_64_start_kernel+0x149/0x158
[610051.752188] ---[ end trace ed7ebbe900e075aa ]---
I also get the following error:
Oct 25 00:10:58 dmack kernel: [ 1407.852012] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000000
Oct 25 00:10:58 dmack kernel: [ 1409.853850] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
I’ve tried drivers 352, 355, 358 and they all have the same error.
The following is the output of glxgears:
#> optirun glxgears
1212 frames in 12.6 seconds = 96.262 FPS
11008 frames in 5.0 seconds = 2201.547 FPS
6564 frames in 7.0 seconds = 938.225 FPS
2167 frames in 5.0 seconds = 433.397 FPS
11132 frames in 5.0 seconds = 2226.254 FPS
6638 frames in 7.0 seconds = 950.558 FPS
11031 frames in 5.0 seconds = 2206.043 FPS
This has always happened for me on this laptop, running various kernels and versions of the nvidia proprietary drivers.
My current setup is:
Ubuntu: 15.10
Kernel: 4.2.0-16
Nvidia drivers: nvidia-355
Bumblebee: 3.2.1-9
virtualgl: 2.4.1-1
nvidia: 01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 850M] (rev a2)
intel: 00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
cpu model name: Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz
primus run yields the same error but a black glxgears window.
Any help would be much appreciated!! This has bugged me for nearly a year now with many attempts to solve.
Thanks.
Hi NVIDIA folks,
I have another crashed caused by the nvidia module. Is there a debugging version you can deploy?
[] ? _nv008585rm+0x90/0x3e0 [nvidia]
PAH
WARNING: CPU: 0 PID: 3 at kernel/watchdog.c:304 watchdog_overflow_callback+0x9a/0xd0()
Jan 7 21:46:07 vesuvius kernel: [705149.651482] Watchdog detected hard LOCKUP on cpu 0
Jan 7 21:46:07 vesuvius kernel: [705149.651483] Modules linked in: snd_seq snd_seq_device nls_utf8 nls_cp437 vfat fat hid_magicmouse hidp rfcomm nvidia(PO) nfsv3 binfmt_misc cpufreq_stats cpufreq_powersave cpufreq_userspace cpufreq_conservative bnep nfc tun 8021q garp mrp stp llc nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache snd_hda_codec_hdmi snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep sp5100_tco btusb btbcm btintel bluetooth evdev joydev kvm_amd snd_pcm rfkill snd_timer mgag200 kvm snd ttm soundcore psmouse amd64_edac_mod edac_mce_amd serio_raw pcspkr drm_kms_helper edac_core sg fam15h_power k10temp i2c_piix4 8250_fintek tpm_tis acpi_cpufreq tpm shpchp button processor thermal_sys md_mod igb ptp pps_core dca i2c_algo_bit loop fuse parport_pc ppdev lp parport ext4 crc16 mbcache jbd2 btrfs xor raid6_pq usb_storage hid_generic usbhid hid sr_mod cdrom sd_mod ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 ata_generic lrw gf128mul glue_helper ablk_helper cryptd ahci mpt3sas ohci_hcd ehci_pci libahci pata_atiixp raid_class ehci_hcd scsi_transport_sas libata drm usbcore scsi_mod usb_common i2c_core zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) ipmi_watchdog dm_mirror dm_region_hash dm_log dm_mod ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 [last unloaded: nvidia]
Jan 7 21:46:07 vesuvius kernel: [705149.651590] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: P O 4.1.1 #1
Jan 7 21:46:07 vesuvius kernel: [705149.651592] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.5 11/25/2013
Jan 7 21:46:07 vesuvius kernel: [705149.651593] 0000000000000000 ffffffff81722646 ffffffff815589ac ffff881027c05b90
Jan 7 21:46:07 vesuvius kernel: [705149.651597] ffffffff8106f241 ffff881021743000 0000000000000000 ffff881027c05c80
Jan 7 21:46:07 vesuvius kernel: [705149.651599] 0000000000000000 ffff881027c05ef8 ffffffff8106f2ba ffffffff81719470
Jan 7 21:46:07 vesuvius kernel: [705149.651602] Call Trace:
Jan 7 21:46:07 vesuvius kernel: [705149.651605] [] ? dump_stack+0x40/0x50
Jan 7 21:46:07 vesuvius kernel: [705149.651614] [] ? warn_slowpath_common+0x81/0xb0
Jan 7 21:46:07 vesuvius kernel: [705149.651616] [] ? warn_slowpath_fmt+0x4a/0x50
Jan 7 21:46:07 vesuvius kernel: [705149.651620] [] ? watchdog_overflow_callback+0x9a/0xd0
Jan 7 21:46:07 vesuvius kernel: [705149.651624] [] ? __perf_event_overflow+0x86/0x220
Jan 7 21:46:07 vesuvius kernel: [705149.651627] [] ? x86_perf_event_set_period+0xd7/0x160
Jan 7 21:46:07 vesuvius kernel: [705149.651630] [] ? x86_pmu_handle_irq+0x125/0x170
Jan 7 21:46:07 vesuvius kernel: [705149.651634] [] ? ioremap_page_range+0x281/0x3f0
Jan 7 21:46:07 vesuvius kernel: [705149.651638] [] ? vunmap_page_range+0x1c3/0x2d0
Jan 7 21:46:07 vesuvius kernel: [705149.651643] [] ? ghes_copy_tofrom_phys+0x121/0x200
Jan 7 21:46:07 vesuvius kernel: [705149.651646] [] ? ghes_read_estatus+0x71/0x150
Jan 7 21:46:07 vesuvius kernel: [705149.651649] [] ? perf_event_nmi_handler+0x2a/0x50
Jan 7 21:46:07 vesuvius kernel: [705149.651652] [] ? nmi_handle+0x8e/0x120
Jan 7 21:46:07 vesuvius kernel: [705149.651654] [] ? default_do_nmi+0x40/0x110
Jan 7 21:46:07 vesuvius kernel: [705149.651656] [] ? do_nmi+0x80/0xc0
Jan 7 21:46:07 vesuvius kernel: [705149.651659] [] ? end_repeat_nmi+0x1e/0x2e
Jan 7 21:46:07 vesuvius kernel: [705149.651798] [] ? _nv008585rm+0x90/0x3e0 [nvidia]
Jan 7 21:46:07 vesuvius kernel: [705149.651810] [] ? io_watchdog_func+0x167/0x3a0 [ohci_hcd]
Jan 7 21:46:07 vesuvius kernel: [705149.651818] [] ? io_watchdog_func+0x167/0x3a0 [ohci_hcd]
Jan 7 21:46:07 vesuvius kernel: [705149.651826] [] ? io_watchdog_func+0x167/0x3a0 [ohci_hcd]
Jan 7 21:46:07 vesuvius kernel: [705149.651828] <> [] ? ohci_dump+0x80/0x80 [ohci_hcd]
Jan 7 21:46:07 vesuvius kernel: [705149.651839] [] ? call_timer_fn+0x30/0x100
Jan 7 21:46:07 vesuvius kernel: [705149.651847] [] ? ohci_dump+0x80/0x80 [ohci_hcd]
Jan 7 21:46:07 vesuvius kernel: [705149.651849] [] ? run_timer_softirq+0x209/0x2f0
Jan 7 21:46:07 vesuvius kernel: [705149.651853] [] ? __do_softirq+0xe0/0x260
Jan 7 21:46:07 vesuvius kernel: [705149.651855] [] ? run_ksoftirqd+0x29/0x70
Jan 7 21:46:07 vesuvius kernel: [705149.651858] [] ? smpboot_thread_fn+0x11f/0x170
Jan 7 21:46:07 vesuvius kernel: [705149.651860] [] ? sort_range+0x30/0x30
Jan 7 21:46:07 vesuvius kernel: [705149.651863] [] ? kthread+0xc1/0xe0
Jan 7 21:46:07 vesuvius kernel: [705149.651866] [] ? kthread_create_on_node+0x180/0x180
Jan 7 21:46:07 vesuvius kernel: [705149.651870] [] ? ret_from_fork+0x42/0x70
Jan 7 21:46:07 vesuvius kernel: [705149.651872] [] ? kthread_create_on_node+0x180/0x180
Jan 7 21:46:07 vesuvius kernel: [705149.651875] ---[ end trace 64766150e53a4982 ]---