sudden increase in load 4 -> 258 in few minutes with NVidia DNN training

Hello

I have a system with two NVidia gtx 1080 graphics cards for deep
neural net training. Every once in a while (1x per hour–day) the
load of the system goes up from 4 to 258 in a matter of minutes,
without CPU utilization. The systems becomes unusable, only hard
reboots suffice.

It seems to happen only when both graphics cards are used for
computation (training of a deep neural net), otherwise the system has
been up for 100+ days.

My suspicion is that there is bug in the driver from NVidia (version
375.66), I have observed this behaviour at (debian stock) linux
kernels 4.9.30 and 3.16.39. I also have an almost identical machine with
exactly the same problem.

I don’t know where to look for a clue to the problem, or for a
workable solution, I have updated systems, drivers, but the problems
stay. (A while ago, when I had only one GPU card per system, the problems
didn’t seem to be there). Therefore I have included some logs,
perhaps someone can help me finding what is going on.

I am further debugging with a load monitoring script what happens just
before the load goes up, it seems a whole lot of new, blocking,
processes are started up when things go hay-wiry:

(datetime INFO /proc/loadavg)

2017-06-18 15:13:13,296 INFO 3.05 3.32 2.89 5/296 10257
2017-06-18 15:13:23,304 INFO 3.27 3.36 2.91 8/289 10292
2017-06-18 15:13:33,314 INFO 3.30 3.36 2.92 1/257 10340
2017-06-18 15:13:43,325 INFO 3.34 3.37 2.92 12/292 10547
2017-06-18 15:13:53,332 INFO 3.67 3.44 2.95 7/281 10556
2017-06-18 15:14:03,343 INFO 5.23 3.77 3.07 1/282 10613
2017-06-18 15:14:13,353 INFO 10.00 4.83 3.42 1/302 10633
2017-06-18 15:14:23,363 INFO 17.10 6.52 3.98 1/322 10653
2017-06-18 15:14:33,374 INFO 26.19 8.81 4.76 1/343 10683
2017-06-18 15:14:43,384 INFO 37.03 11.70 5.74 1/360 10703
2017-06-18 15:14:53,394 INFO 49.36 15.17 6.93 1/380 10723
2017-06-18 15:15:03,405 INFO 62.87 19.18 8.33 1/402 10745
2017-06-18 15:15:13,415 INFO 77.37 23.72 9.92 1/421 10765
2017-06-18 15:15:23,425 INFO 92.72 28.77 11.71 1/441 10785
2017-06-18 15:15:33,436 INFO 108.79 34.31 13.69 1/454 10816
2017-06-18 15:15:43,446 INFO 125.46 40.33 15.87 1/474 10836
2017-06-18 15:15:53,456 INFO 142.64 46.81 18.24 1/489 10856
2017-06-18 15:16:03,467 INFO 159.93 53.66 20.78 1/502 10870
2017-06-18 15:16:13,477 INFO 175.01 60.39 23.32 1/500 10870
2017-06-18 15:16:23,487 INFO 187.77 66.90 25.84 1/498 10870
2017-06-18 15:16:33,498 INFO 198.57 73.19 28.32 1/498 10871
2017-06-18 15:16:43,508 INFO 207.71 79.28 30.79 1/498 10879
2017-06-18 15:16:53,519 INFO 215.44 85.16 33.22 1/493 10879

Kernel messages suggest there is something going on with the driver indeed.

(pardon the bad formatting)

Jun 18 15:13:36 deep-novo-2 kernel: [ 2120.959968] nvidia
0000:04:00.0: irq 93 for MSI/MSI-X
Jun 18 15:13:57 deep-novo-2 kernel: [ 2141.101859] nvidia
0000:03:00.0: irq 92 for MSI/MSI-X
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008023] INFO: task
kworker/9:5:6899 blocked for more than 120 seconds.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008043] Tainted: P
O 3.16.0-4-amd64 #1
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008055] “echo 0 >
/proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008072] kworker/9:5 D
ffff88088fa26e78 0 6899 2 0x00000000
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008216] Workqueue: events
os_execute_work_item [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008219] ffff88088fa26a20
0000000000000046 0000000000012f40 ffff88088fa37fd8
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008224] 0000000000012f40
ffff88088fa26a20 ffff882036378608 ffff88088fa37c68
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008227] 7fffffffffffffff
0000000000000002 0000000000000000 ffff88088fa26a20
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008231] Call Trace:
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008241]
[] ? schedule_timeout+0x259/0x2d0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008248]
[] ? __switch_to+0xde/0x5a0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008252]
[] ? __schedule+0x2b1/0x6f0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008256]
[] ? __down_common+0xa0/0xf3
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008262]
[] ? down+0x3b/0x50
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008351]
[] ? os_acquire_mutex+0x43/0x50 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008522]
[] ? _nv017509rm+0x18/0x30 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008688]
[] ? _nv019728rm+0x3d/0x120 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.008850]
[] ? _nv006687rm+0x889/0xa60 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009022]
[] ? _nv000785rm+0x225/0xcd0 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009193]
[] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009282]
[] ? os_execute_work_item+0x40/0x60 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009287]
[] ? process_one_work+0x143/0x430
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009290]
[] ? worker_thread+0x113/0x4f0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009293]
[] ? rescuer_thread+0x2d0/0x2d0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009298]
[] ? kthread+0xbd/0xe0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009302]
[] ? kthread_create_on_node+0x180/0x180
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009307]
[] ? ret_from_fork+0x58/0x90
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009311]
[] ? kthread_create_on_node+0x180/0x180
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009315] INFO: task
kworker/9:7:6901 blocked for more than 120 seconds.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009329] Tainted: P
O 3.16.0-4-amd64 #1
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009341] “echo 0 >
/proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009357] kworker/9:7 D
ffff88088fa33808 0 6901 2 0x00000000
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009448] Workqueue: events
os_execute_work_item [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009450] ffff88088fa333b0
0000000000000046 0000000000012f40 ffff88088fa47fd8
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009453] 0000000000012f40
ffff88088fa333b0 ffff882036378608 ffff88088fa47c58
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009457] 7fffffffffffffff
0000000000000002 0000000000000000 ffff88088fa333b0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009460] Call Trace:
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009465]
[] ? schedule_timeout+0x259/0x2d0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009709]
[] ? _nv012002rm+0x37b/0x400 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.009954]
[] ? _nv011729rm+0xfd/0x170 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010043]
[] ? os_acquire_spinlock+0xe/0x20 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010287]
[] ? _nv012002rm+0x37b/0x400 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010292]
[] ? __down_common+0xa0/0xf3
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010297]
[] ? hw_breakpoint_exceptions_notify+0xd0/0x150
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010302]
[] ? down+0x3b/0x50
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010390]
[] ? os_acquire_mutex+0x43/0x50 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010561]
[] ? _nv017509rm+0x18/0x30 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.010726]
[] ? _nv019728rm+0x3d/0x120 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011004]
[] ? _nv018215rm+0x118/0x200 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011174]
[] ? _nv000785rm+0x225/0xcd0 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011344]
[] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011434]
[] ? os_execute_work_item+0x40/0x60 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011438]
[] ? process_one_work+0x143/0x430
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011441]
[] ? worker_thread+0x113/0x4f0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011444]
[] ? rescuer_thread+0x2d0/0x2d0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011449]
[] ? kthread+0xbd/0xe0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011453]
[] ? kthread_create_on_node+0x180/0x180
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011458]
[] ? ret_from_fork+0x58/0x90
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011462]
[] ? kthread_create_on_node+0x180/0x180
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011466] INFO: task
kworker/9:0:8943 blocked for more than 120 seconds.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011480] Tainted: P
O 3.16.0-4-amd64 #1
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011491] “echo 0 >
/proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011507] kworker/9:0 D
ffff882036dd9a88 0 8943 2 0x00000000
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011597] Workqueue: events
os_execute_work_item [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011599] ffff882036dd9630
0000000000000046 0000000000012f40 ffff881ffb8effd8
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011603] 0000000000012f40
ffff882036dd9630 ffff882036378608 ffff881ffb8efc68
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011606] 7fffffffffffffff
0000000000000002 0000000000000000 ffff882036dd9630
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011609] Call Trace:
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011614]
[] ? schedule_timeout+0x259/0x2d0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011618]
[] ? __switch_to+0xde/0x5a0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011622]
[] ? __schedule+0x2b1/0x6f0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011626]
[] ? __down_common+0xa0/0xf3
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011630]
[] ? down+0x3b/0x50
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011718]
[] ? os_acquire_mutex+0x43/0x50 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.011888]
[] ? _nv017509rm+0x18/0x30 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012053]
[] ? _nv019728rm+0x3d/0x120 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012311]
[] ? _nv006687rm+0x889/0xa60 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012518]
[] ? _nv000785rm+0x225/0xcd0 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012720]
[] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012862]
[] ? os_execute_work_item+0x40/0x60 [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012881]
[] ? process_one_work+0x143/0x430
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012898]
[] ? worker_thread+0x113/0x4f0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012915]
[] ? rescuer_thread+0x2d0/0x2d0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012934]
[] ? kthread+0xbd/0xe0
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012952]
[] ? kthread_create_on_node+0x180/0x180
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012971]
[] ? ret_from_fork+0x58/0x90
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012983]
[] ? kthread_create_on_node+0x180/0x180
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.012988] INFO: task
kworker/9:1:8967 blocked for more than 120 seconds.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013004] Tainted: P
O 3.16.0-4-amd64 #1
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013019] “echo 0 >
/proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013036] kworker/9:1 D
ffff88088fa32568 0 8967 2 0x00000000
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013161] Workqueue: events
os_execute_work_item [nvidia]
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013164] ffff88088fa32110
0000000000000046 0000000000012f40 ffff880690623fd8
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013186] 0000000000012f40
ffff88088fa32110 ffff882036378608 ffff880690623c58
Jun 18 15:16:15 deep-novo-2 kernel: [ 2279.013191] 7fffffffffffffff
0000000000000002 0000000000000000 ffff88088fa32110

… and a couple of more repeats of this…

Cheers,

—david