Random hangs on 4x TitanX configuration 375.26

We are running deep learning rig with 4 Nvidia Titan X GPUs, recently we started experiencing random hangs causing us to power cycle the machine frequently. After each freeze, we can no longer call nvidia-smi and all of the GPUs are not accessible. Rebooting from console does not fix the problem. Each incident is followed by this entry in kern.log.

Nov 29 12:07:06 deep01 kernel: [ 2920.003490] CPU: 0 PID: 18816 Comm: nvidia-smi Tainted: P           OE   4.4.0-66-generic #87-Ubuntu
Nov 29 12:07:06 deep01 kernel: [ 2920.003511] Hardware name: ASUS All Series/X99-E WS, BIOS 1301 08/05/2015
Nov 29 12:07:06 deep01 kernel: [ 2920.003527] task: ffff88104d75c600 ti: ffff880f7eac4000 task.ti: ffff880f7eac4000
Nov 29 12:07:06 deep01 kernel: [ 2920.003544] RIP: 0010:[<ffffffffc0b776cb>]  [<ffffffffc0b776cb>] _nv006648rm+0x13b/0x300 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.003668] RSP: 0018:ffff880f7eac7970  EFLAGS: 00010246
Nov 29 12:07:06 deep01 kernel: [ 2920.003683] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Nov 29 12:07:06 deep01 kernel: [ 2920.003702] RDX: ffff88102244d408 RSI: ffff881015244008 RDI: ffff88102244d408
Nov 29 12:07:06 deep01 kernel: [ 2920.003721] RBP: ffff881021912ce8 R08: 0000000000000000 R09: 0000000000000200
Nov 29 12:07:06 deep01 kernel: [ 2920.003740] R10: ffff8810544e6008 R11: ffff88105ec03200 R12: ffff8810544e6008
Nov 29 12:07:06 deep01 kernel: [ 2920.003759] R13: ffff8810544e6010 R14: ffff881015244008 R15: ffff88102244d408
Nov 29 12:07:06 deep01 kernel: [ 2920.003779] FS:  00007fc51d4ef700(0000) GS:ffff88105f200000(0000) knlGS:0000000000000000
Nov 29 12:07:06 deep01 kernel: [ 2920.004713] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 29 12:07:06 deep01 kernel: [ 2920.005182] CR2: 00007ffe1abd6fc8 CR3: 0000001022178000 CR4: 00000000001406f0
Nov 29 12:07:06 deep01 kernel: [ 2920.005654] Stack:
Nov 29 12:07:06 deep01 kernel: [ 2920.006109]  ffff8810544e6010 0000000000000007 0000000000000001 ffff881021912e20
Nov 29 12:07:06 deep01 kernel: [ 2920.007027]  0000000000000008 ffffffffc0b64f8f 0000000000000000 ffff881055930008
Nov 29 12:07:06 deep01 kernel: [ 2920.007845]  ffff880f9b702008 ffff881050294008 ffff88104d3bc508 ffffffffc0b67a61
Nov 29 12:07:06 deep01 kernel: [ 2920.008572] Call Trace:
Nov 29 12:07:06 deep01 kernel: [ 2920.008986]  [<ffffffffc0b64f8f>] ? _nv021955rm+0x1e0f/0x7100 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.009404]  [<ffffffffc0b67a61>] ? _nv021955rm+0x48e1/0x7100 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.009811]  [<ffffffffc0b67a28>] ? _nv021955rm+0x48a8/0x7100 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.010204]  [<ffffffffc0b3930e>] ? _nv021804rm+0x17e/0x2a0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.010589]  [<ffffffffc0b3921f>] ? _nv021804rm+0x8f/0x2a0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.010984]  [<ffffffffc0fd7c5c>] ? _nv017596rm+0x24c/0xad0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.011388]  [<ffffffffc0fd9219>] ? _nv000800rm+0x279/0x6e0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.011786]  [<ffffffffc0fcd1d8>] ? rm_init_adapter+0x128/0x130 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.012091]  [<ffffffff810ac4e5>] ? wake_up_process+0x15/0x20
Nov 29 12:07:06 deep01 kernel: [ 2920.012425]  [<ffffffffc0a2347d>] ? nv_open_device+0x12d/0x6d0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.012756]  [<ffffffffc0a23cfd>] ? nvidia_open+0x14d/0x2f0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.013079]  [<ffffffffc0a22328>] ? nvidia_frontend_open+0x58/0xa0 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.013379]  [<ffffffff8121371f>] ? chrdev_open+0xbf/0x1b0
Nov 29 12:07:06 deep01 kernel: [ 2920.013658]  [<ffffffff8120c84f>] ? do_dentry_open+0x1ff/0x310
Nov 29 12:07:06 deep01 kernel: [ 2920.013932]  [<ffffffff81213660>] ? cdev_put+0x30/0x30
Nov 29 12:07:06 deep01 kernel: [ 2920.014202]  [<ffffffff8120d9e4>] ? vfs_open+0x54/0x80
Nov 29 12:07:06 deep01 kernel: [ 2920.014469]  [<ffffffff8121971b>] ? may_open+0x5b/0xf0
Nov 29 12:07:06 deep01 kernel: [ 2920.014734]  [<ffffffff8121d5a7>] ? path_openat+0x1b7/0x1330
Nov 29 12:07:06 deep01 kernel: [ 2920.014994]  [<ffffffff8121e894>] ? putname+0x54/0x60
Nov 29 12:07:06 deep01 kernel: [ 2920.015245]  [<ffffffff8121f911>] ? do_filp_open+0x91/0x100
Nov 29 12:07:06 deep01 kernel: [ 2920.015510]  [<ffffffff8122d216>] ? __alloc_fd+0x46/0x190
Nov 29 12:07:06 deep01 kernel: [ 2920.015765]  [<ffffffff8120ddb8>] ? do_sys_open+0x138/0x2a0
Nov 29 12:07:06 deep01 kernel: [ 2920.016001]  [<ffffffff8120df3e>] ? SyS_open+0x1e/0x20
Nov 29 12:07:06 deep01 kernel: [ 2920.016231]  [<ffffffff8183c5f2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
Nov 29 12:07:06 deep01 kernel: [ 2920.016456] Code: 03 00 31 c0 e8 27 a3 3d 00 0f 1f 80 00 00 00 00 4c 89 f6 44 8b 45 04 8b 4d 00 4c 89 fa 48 8b 7d 08 41 ff 97 18 01 00 00 4c 89 ff <41> ff 97 d0 00 00 00
41 89 84 24 7c 10 00 00 8b 75 04 85 f6 0f
Nov 29 12:07:06 deep01 kernel: [ 2920.017172] RIP  [<ffffffffc0b776cb>] _nv006648rm+0x13b/0x300 [nvidia]
Nov 29 12:07:06 deep01 kernel: [ 2920.017458]  RSP <ffff880f7eac7970>
Nov 29 12:07:06 deep01 kernel: [ 2920.018111] ---[ end trace 2ad4840a7f498189 ]---

nvidia-bug-report.log.gz (564 KB)