Spinlock/Scheduling issues inside the nvidia kernel driver on Linux 4.10

I’ve been experiencing some issues with scheduling that pop up when running complex Vulkan (and less often, OpenGL) applications.

I’d like to apologize, first of all, for reporting these issues with a custom kernel, but I feel that it still may have some merit to it.

Here’s the relevant kernel log buffer noise at the time of the incident -

[ 8879.088447] BUG: scheduling while atomic: irq/39-s-nvidia/2636/0x00000002
[ 8879.088450] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq iptable_filter rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace fscache tun pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel snd_hda_codec_hdmi uinput kvm irqbypass nvidia_uvm(PO) crct10dif_pclmul mxm_wmi nvidia_drm(PO) crc32_pclmul ghash_clmulni_intel pcbc snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 crypto_simd glue_helper evdev cryptd serio_raw snd_pcsp drm_kms_helper snd_hda_intel syscopyarea sysfillrect snd_hda_codec sysimgblt snd_hwdep fb_sys_fops snd_hda_core snd_pcm
[ 8879.088460]  tpm_tis drm mei_me snd_timer tpm_tis_core tpm mei wmi snd button video acpi_pad soundcore nvidia_modeset(PO) nvidia(PO) sunrpc ip_tables x_tables hid_generic usbhid uas hid usb_storage crc32c_intel ahci xhci_pci libahci ehci_pci xhci_hcd ehci_hcd usbcore fan
[ 8879.088467] CPU: 7 PID: 2636 Comm: irq/39-s-nvidia Tainted: P           O    4.10.0-pf1turbokoopa #28
[ 8879.088468] Hardware name: ASUS All Series/Z97-A-USB31, BIOS 2501 06/24/2015
[ 8879.088468] Call Trace:
[ 8879.088471]  ? dump_stack+0x46/0x5a
[ 8879.088472]  ? __schedule_bug+0x3d/0x50
[ 8879.088474]  ? __schedule+0x865/0xa80
[ 8879.088475]  ? ttwu_do_wakeup+0x2e/0x50
[ 8879.088476]  ? schedule+0x34/0xc0
[ 8879.088477]  ? schedule_timeout+0x14f/0x1a0
[ 8879.088574]  ? _nv011712rm+0xfd/0x170 [nvidia]
[ 8879.088605]  ? os_acquire_spinlock+0x9/0x20 [nvidia]
[ 8879.088606]  ? __down+0x61/0xa0
[ 8879.088608]  ? down+0x36/0x50
[ 8879.088651]  ? nv_get_adapter_state+0x24/0xb0 [nvidia]
[ 8879.088720]  ? _nv017649rm+0xdc/0x120 [nvidia]
[ 8879.088797]  ? _nv008529rm+0x63/0x290 [nvidia]
[ 8879.088884]  ? _nv010451rm+0xa7/0xb0 [nvidia]
[ 8879.088977]  ? _nv014947rm+0x5b2/0x5d0 [nvidia]
[ 8879.089038]  ? _nv000809rm+0x10c/0x120 [nvidia]
[ 8879.089039]  ? irq_forced_thread_fn+0x60/0x60
[ 8879.089095]  ? rm_isr_bh+0x23/0x70 [nvidia]
[ 8879.089123]  ? nvidia_isr_common_bh+0x33/0x60 [nvidia]
[ 8879.089124]  ? irq_thread_fn+0x16/0x40
[ 8879.089125]  ? irq_thread+0x109/0x190
[ 8879.089126]  ? wake_threads_waitq+0x30/0x30
[ 8879.089127]  ? kthread+0xea/0x120
[ 8879.089127]  ? irq_thread_dtor+0xb0/0xb0
[ 8879.089128]  ? kthread_create_on_node+0x40/0x40
[ 8879.089129]  ? do_group_exit+0x2e/0xa0
[ 8879.089130]  ? ret_from_fork+0x23/0x30
[ 8879.089150] BUG: scheduling while atomic: irq/39-s-nvidia/2636/0x00000000
[ 8879.089152] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq iptable_filter rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace fscache tun pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel snd_hda_codec_hdmi uinput kvm irqbypass nvidia_uvm(PO) crct10dif_pclmul mxm_wmi nvidia_drm(PO) crc32_pclmul ghash_clmulni_intel pcbc snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 crypto_simd glue_helper evdev cryptd serio_raw snd_pcsp drm_kms_helper snd_hda_intel syscopyarea sysfillrect snd_hda_codec sysimgblt snd_hwdep fb_sys_fops snd_hda_core snd_pcm
[ 8879.089171]  tpm_tis drm mei_me snd_timer tpm_tis_core tpm mei wmi snd button video acpi_pad soundcore nvidia_modeset(PO) nvidia(PO) sunrpc ip_tables x_tables hid_generic usbhid uas hid usb_storage crc32c_intel ahci xhci_pci libahci ehci_pci xhci_hcd ehci_hcd usbcore fan
[ 8879.089181] CPU: 7 PID: 2636 Comm: irq/39-s-nvidia Tainted: P        W  O    4.10.0-pf1turbokoopa #28
[ 8879.089181] Hardware name: ASUS All Series/Z97-A-USB31, BIOS 2501 06/24/2015
[ 8879.089182] Call Trace:
[ 8879.089183]  ? dump_stack+0x46/0x5a
[ 8879.089184]  ? __schedule_bug+0x3d/0x50
[ 8879.089185]  ? __schedule+0x865/0xa80
[ 8879.089186]  ? irq_forced_thread_fn+0x60/0x60
[ 8879.089186]  ? schedule+0x34/0xc0
[ 8879.089188]  ? irq_thread+0x9a/0x190
[ 8879.089189]  ? wake_threads_waitq+0x30/0x30
[ 8879.089190]  ? kthread+0xea/0x120
[ 8879.089191]  ? irq_thread_dtor+0xb0/0xb0
[ 8879.089192]  ? kthread_create_on_node+0x40/0x40
[ 8879.089193]  ? do_group_exit+0x2e/0xa0
[ 8879.089194]  ? ret_from_fork+0x23/0x30

And a debug log/dump generated during normal operation (I’ll try and capture one next time this happens, so that there’s more relevant info): http://paste.qc.to/5tlLYUkquL?mime=application/octet-stream (nvidia-bug-report.log.gz)