Kernel freeze, unkillable X (ubuntu 16.0.4.2)

On kernel 4.4, Nvidia 381.22 drivers.

When you attach a DFP to a grid card.

[ 89.320648] nvidia-modeset: Allocated GPU:0 (GPU-a555a536-6ff1-f876-a430-3eabc71e5bbf) @ PCI:0000:83:00.0
[ 94.077780] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

[ 436.176874] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [Xorg:2227]
[ 436.176878] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs nvidia_uvm(POE) joydev input_leds intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif binfmt_misc ipmi_si ipmi_msghandler irqbypass serio_raw lpc_ich ioatdma wmi mei_me sb_edac 8250_fintek shpchp dca edac_core mei mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear hid_generic nvidia_drm(POE)
[ 436.176915] usbhid nvidia_modeset(POE) hid crct10dif_pclmul crc32_pclmul ast ghash_clmulni_intel i2c_algo_bit aesni_intel ttm aes_x86_64 lrw drm_kms_helper bnx2x syscopyarea e1000e sysfillrect vxlan ip6_udp_tunnel sysimgblt udp_tunnel gf128mul fb_sys_fops isci glue_helper ptp ablk_helper ahci libsas cryptd psmouse nvidia(POE) drm libahci pps_core scsi_transport_sas mdio libcrc32c fjes
[ 436.176942] CPU: 15 PID: 2227 Comm: Xorg Tainted: P OEL 4.4.0-78-generic #99-Ubuntu
[ 436.176944] Hardware name: OOMMM
[ 436.176945] task: ffff881054f18e00 ti: ffff88085a2b8000 task.ti: ffff88085a2b8000
[ 436.176947] RIP: 0010:[] [] _raw_spin_unlock_irqrestore+0x15/0x20
[ 436.176951] RSP: 0018:ffff88085a2bb9f8 EFLAGS: 00000246
[ 436.176952] RAX: 000000000000381f RBX: 0000000000000000 RCX: 0000000000000000
[ 436.176953] RDX: 0000000000000cfc RSI: 0000000000000246 RDI: 0000000000000246
[ 436.176955] RBP: ffff88085a2bb9f8 R08: 0000000000000004 R09: ffff88085a2bba14
[ 436.176956] R10: 0000000000000000 R11: ffffffffc06880e0 R12: ffff88085a2bba60
[ 436.176957] R13: 0000000000000246 R14: ffff88085b352000 R15: 000000000001000c
[ 436.176959] FS: 00007fe9f4604a00(0000) GS:ffff88105f3c0000(0000) knlGS:0000000000000000
[ 436.176960] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 436.176961] CR2: 00007f4c01ffae78 CR3: 0000000853876000 CR4: 00000000000406e0
[ 436.176962] Stack:
[ 436.176963] ffff88085a2bba38 ffffffff8143f52b 0000000000000020 0000381f8143f47c
[ 436.176965] 000000008296d3fd ffff88105b8e6120 ffff88105b8e6000 0000000000000002
[ 436.176967] ffff88085a2bba48 ffffffffc011691e ffff88085a2bba90 ffffffffc0109dcf
[ 436.176969] Call Trace:
[ 436.176975] [] pci_bus_read_config_dword+0x9b/0xb0
[ 436.177082] [] os_pci_read_dword+0x2e/0x40 [nvidia]
[ 436.177151] [] nv_check_pci_config_space+0x27f/0x300 [nvidia]
[ 436.177220] [] nv_verify_pci_config+0x81/0x90 [nvidia]
[ 436.177331] [] _nv017843rm+0x58/0x70 [nvidia]
[ 436.177451] [] ? _nv000219rm+0x4b/0x70 [nvidia]
[ 436.177564] [] ? _nv008370rm+0x1bf/0x2b0 [nvidia]
[ 436.177688] [] ? _nv002664rm+0x9/0x30 [nvidia]
[ 436.177810] [] ? _nv002839rm+0x15/0x80 [nvidia]
[ 436.177933] [] ? _nv005166rm+0x1e4/0x220 [nvidia]
[ 436.178056] [] ? _nv005165rm+0xc1/0xe0 [nvidia]
[ 436.178178] [] ? _nv020844rm+0x3e/0x1a0 [nvidia]
[ 436.178296] [] ? _nv000802rm+0x22b/0x3b0 [nvidia]
[ 436.178414] [] ? _nv006562rm+0x3c5/0x450 [nvidia]
[ 436.178531] [] ? _nv000802rm+0x36/0x3b0 [nvidia]
[ 436.178649] [] ? _nv003762rm+0x602/0x26c0 [nvidia]
[ 436.178759] [] ? rm_kernel_rmapi_op+0xb1/0x1f0 [nvidia]
[ 436.178777] [] ? nvkms_call_rm+0x59/0x70 [nvidia_modeset]
[ 436.178790] [] ? _nv002011kms+0x4e/0x70 [nvidia_modeset]
[ 436.178802] [] ? _nv000200kms+0x147/0x360 [nvidia_modeset]
[ 436.178810] [] ? _nv000198kms+0x40/0x40 [nvidia_modeset]
[ 436.178822] [] ? _nv001992kms+0x10/0x20 [nvidia_modeset]
[ 436.178832] [] ? _nv000197kms+0x5d/0xe0 [nvidia_modeset]
[ 436.178840] [] ? _nv000333kms+0x6a/0x80 [nvidia_modeset]
[ 436.178848] [] ? nvKmsIoctl+0x163/0x1e0 [nvidia_modeset]
[ 436.178857] [] ? nvkms_ioctl_common+0x45/0x80 [nvidia_modeset]
[ 436.178865] [] ? nvkms_ioctl+0x71/0xa0 [nvidia_modeset]
[ 436.178933] [] ? nvidia_frontend_compat_ioctl+0x40/0x50 [nvidia]
[ 436.179002] [] ? nvidia_frontend_unlocked_ioctl+0xe/0x10 [nvidia]
[ 436.179005] [] ? do_vfs_ioctl+0x29f/0x490
[ 436.179007] [] ? __do_page_fault+0x1b4/0x400
[ 436.179009] [] ? SyS_ioctl+0x79/0x90
[ 436.179012] [] ? entry_SYSCALL_64_fastpath+0x16/0x71
[ 436.179013] Code: 90 66 66 90 eb c6 31 c0 eb ca e8 97 0c 84 ff 90 90 90 90 90 90 90 66 66 66 66 90 55 48 89 e5 c6 07 00 66 66 66 90 48 89 f7 57 9d <66> 66 90 66 90 5d c3 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 c6
[ 440.424792] INFO: rcu_sched self-detected stall on CPU
[ 440.424796] 15-…: (14999 ticks this GP) idle=46f/140000000000001/0 softirq=3402/3402 fqs=14996
[ 440.424797] (t=15000 jiffies g=4291 c=4290 q=3781)
[ 440.424801] Task dump for CPU 15:
[ 440.424803] Xorg R running task 0 2227 2194 0x0040000a
[ 440.424805] ffff881054f18e00 000000008296d3fd ffff88105f3c3da8 ffffffff810aecd9
[ 440.424807] 000000000000000f ffffffff81e52ec0 ffff88105f3c3dc0 ffffffff810b1527
[ 440.424809] 0000000000000010 ffff88105f3c3df0 ffffffff810e501e ffff88105f3d7b40
[ 440.424811] Call Trace:
[ 440.424812] [] sched_show_task+0xa9/0x110
[ 440.424820] [] dump_cpu_task+0x37/0x40
[ 440.424824] [] rcu_dump_cpu_stacks+0x8e/0xe0
[ 440.424826] [] rcu_check_callbacks+0x4fa/0x7f0
[ 440.424832] [] ? acct_account_cputime+0x1c/0x20
[ 440.424835] [] ? account_system_time+0x7f/0x110
[ 440.424849] [] ? tick_sched_handle.isra.14+0x60/0x60
[ 440.424851] [] update_process_times+0x39/0x60
[ 440.424853] [] tick_sched_handle.isra.14+0x25/0x60
[ 440.424855] [] tick_sched_timer+0x3d/0x70
[ 440.424857] [] __hrtimer_run_queues+0x102/0x290
[ 440.424859] [] hrtimer_interrupt+0xa8/0x1a0
[ 440.424863] [] local_apic_timer_interrupt+0x38/0x60
[ 440.424867] [] smp_apic_timer_interrupt+0x3d/0x50
[ 440.424869] [] apic_timer_interrupt+0x82/0x90
[ 440.424870] [] ? _nv012410rm+0x60/0x60 [nvidia]
[ 440.424987] [] ? _raw_spin_unlock_irqrestore+0x15/0x20
[ 440.424989] [] pci_bus_read_config_dword+0x9b/0xb0
[ 440.425060] [] os_pci_read_dword+0x2e/0x40 [nvidia]
[ 440.425129] [] nv_check_pci_config_space+0x1c9/0x300 [nvidia]
[ 440.425198] [] nv_verify_pci_config+0x81/0x90 [nvidia]
[ 440.425310] [] _nv017843rm+0x58/0x70 [nvidia]
[ 440.425429] [] ? _nv000219rm+0x4b/0x70 [nvidia]
[ 440.425542] [] ? _nv008370rm+0x1bf/0x2b0 [nvidia]
[ 440.425664] [] ? _nv002664rm+0x9/0x30 [nvidia]
[ 440.425786] [] ? _nv002839rm+0x15/0x80 [nvidia]
[ 440.425909] [] ? _nv005166rm+0x1e4/0x220 [nvidia]
[ 440.426031] [] ? _nv005165rm+0xc1/0xe0 [nvidia]
[ 440.426154] [] ? _nv020844rm+0x3e/0x1a0 [nvidia]
[ 440.426272] [] ? _nv000802rm+0x22b/0x3b0 [nvidia]
[ 440.426390] [] ? _nv006562rm+0x3c5/0x450 [nvidia]
[ 440.426507] [] ? _nv000802rm+0x36/0x3b0 [nvidia]
[ 440.426625] [] ? _nv003762rm+0x602/0x26c0 [nvidia]
[ 440.426735] [] ? rm_kernel_rmapi_op+0xb1/0x1f0 [nvidia]
[ 440.426744] [] ? nvkms_call_rm+0x59/0x70 [nvidia_modeset]
[ 440.426757] [] ? _nv002011kms+0x4e/0x70 [nvidia_modeset]
[ 440.426769] [] ? _nv000200kms+0x147/0x360 [nvidia_modeset]
[ 440.426778] [] ? _nv000198kms+0x40/0x40 [nvidia_modeset]
[ 440.426790] [] ? _nv001992kms+0x10/0x20 [nvidia_modeset]
[ 440.426799] [] ? _nv000197kms+0x5d/0xe0 [nvidia_modeset]
[ 440.426807] [] ? _nv000333kms+0x6a/0x80 [nvidia_modeset]
[ 440.426816] [] ? nvKmsIoctl+0x163/0x1e0 [nvidia_modeset]
[ 440.426824] [] ? nvkms_ioctl_common+0x45/0x80 [nvidia_modeset]
[ 440.426832] [] ? nvkms_ioctl+0x71/0xa0 [nvidia_modeset]
[ 440.426901] [] ? nvidia_frontend_compat_ioctl+0x40/0x50 [nvidia]
[ 440.426969] [] ? nvidia_frontend_unlocked_ioctl+0xe/0x10 [nvidia]
[ 440.426971] [] ? do_vfs_ioctl+0x29f/0x490
[ 440.426973] [] ? __do_page_fault+0x1b4/0x400
[ 440.426975] [] ? SyS_ioctl+0x79/0x90
[ 440.426977] [] ? entry_SYSCALL_64_fastpath+0x16/0x71
[ 468.176270] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [Xorg:2227]

Kernel is stuck only way to soft reboot is
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

Please run nvidia-bug-report.sh as root and attach the tar.gz file it creates to your post.

That report has confidential information. I can send it privately to a Nvidia developer if required.

Also if specific details are require I can snip them and paste here.

Red msgs in dmesg. Full kernel lockup, unable to kill any processes, unable to kill X, unable to reboot, X uses 100% of 1 core.

[ 1717.962551] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000000
[ 1725.962373] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000001
[ 1733.962260] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000002
[ 1741.962155] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000003
[ 1749.962043] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000004
[ 1757.961929] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000005
[ 1765.961821] NVRM: Xid (PCI:0000:83:00): 16, Head 00000000 Count 00000006

[ 1793.511952] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000947d:0:0
[ 1801.513345] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000927c:0:0
[ 1829.524787] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000947d:0:0
[ 1837.526350] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000927c:0:0
[ 1865.538484] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000947d:0:0
[ 1873.540016] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000927c:0:0