Latest Driver for GTX 1080Ti blocks Tensoflow processes?

I’m working on a Tensorflow application (using an NVidia GPU) under following circumstances:

  • OS : Ubuntu 16.04.02 LTS
  • GPU: Geforce GTX 1080Ti
  • NVidia Driver: 384.59
  • CUDA ver.: 8.0.61_375.26
  • cuDNN ver.: 5.1
  • Tensorflow ver.: 1.2.1
  • Application Language: Python

This application is run by CRON, and it sometimes stops processing (about 5 times a month, for now).
A process named “[irq/125-nvidia]” fully used one cpu core when this issue happened, and I found following messages in /var/log/kern.log.
Does anyone know how to deal with this problem? Or would I rather ask Tensorflow team?

thanks.

Aug  6 11:44:07 hostname kernel: [688298.871282] NVRM: GPU at PCI:0000:01:00: GPU-628e8113-4b1e-ddf8-259c-b9e2e7653f7b
Aug  6 11:44:07 hostname kernel: [688298.871289] NVRM: GPU Board Serial Number:
Aug  6 11:44:07 hostname kernel: [688298.871295] NVRM: Xid (PCI:0000:01:00): 44, Ch 00000001, engmask 00000101, intr 10000000
Aug  6 11:47:33 hostname kernel: [688504.868263] INFO: task kworker/4:2:11603 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.868270]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.868273] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.868276] kworker/4:2     D ffff880403c27b78     0 11603      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.868616] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.868621]  ffff880403c27b78 ffff8803b60a4008 ffff8800ad79bb00 ffff8803ff648000
Aug  6 11:47:33 hostname kernel: [688504.868627]  ffff880403c28000 ffff8800bffe0e28 ffff8803ff648000 ffff880403c27e18
Aug  6 11:47:33 hostname kernel: [688504.868632]  ffff8803fe8cf788 ffff880403c27b90 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.868637] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.868650]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.868656]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.868662]  [<ffffffff810b4ec3>] ? update_curr+0xe3/0x160
Aug  6 11:47:33 hostname kernel: [688504.868670]  [<ffffffff810b27bc>] ? __enqueue_entity+0x6c/0x70
Aug  6 11:47:33 hostname kernel: [688504.868675]  [<ffffffff810b9597>] ? put_prev_entity+0x97/0x7d0
Aug  6 11:47:33 hostname kernel: [688504.868680]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.868685]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.868992]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.869272]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.869760]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.870472]  [<ffffffffc0a95428>] ? _nv006986rm+0x38/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.870911]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.871347]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.871632]  [<ffffffffc0666101>] ? os_free_mutex+0x1/0x20 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.871912]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.871921]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.871927]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.871933]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.871938]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.871943]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.871949]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.871953]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.871959] INFO: task kworker/4:3:11619 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.871963]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.871966] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.871969] kworker/4:3     D ffff88009b61fb78     0 11619      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.872252] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.872300]  ffff88009b61fb78 0000000000000000 ffff8803ff648000 ffff880392b10ec0
Aug  6 11:47:33 hostname kernel: [688504.872309]  ffff88009b620000 ffff8800bffe0e28 ffff880392b10ec0 ffff88009b61fe18
Aug  6 11:47:33 hostname kernel: [688504.872317]  ffff8803fe8cf748 ffff88009b61fb90 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.872331] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.872349]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.872366]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.872377]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.872388]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.872640]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.872898]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.873368]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.874053]  [<ffffffffc0a95428>] ? _nv006986rm+0x38/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.874485]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.874917]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.875178]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.875414]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.875422]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.875428]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.875433]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.875439]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.875445]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.875450]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.875456]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.875460]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.875469] INFO: task python:12954 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.875473]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.875475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.875477] python          D ffff8803fe7b3928     0 12954  12953 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.875483]  ffff8803fe7b3928 ffffffffc0bd530c ffff88042470bb00 ffff880421a4ac40
Aug  6 11:47:33 hostname kernel: [688504.875491]  ffff8803fe7b4000 ffff8800bffe0e28 ffff880421a4ac40 ffff8803fe7b3bd0
Aug  6 11:47:33 hostname kernel: [688504.875497]  ffff880403044008 ffff8803fe7b3940 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.875502] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.875994]  [<ffffffffc0bd530c>] ? _nv030836rm+0xc/0x20 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.876004]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.876009]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.876496]  [<ffffffffc0c06998>] ? _nv020294rm+0x8/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.876976]  [<ffffffffc0c0737d>] ? _nv020343rm+0xd/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.877458]  [<ffffffffc0c074f4>] ? _nv020376rm+0x34/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.877928]  [<ffffffffc0c32f59>] ? _nv006261rm+0x109/0x240 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.877945]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.877956]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.878198]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.878456]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.878919]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.879340]  [<ffffffffc0cb3a30>] ? rm_get_gpu_uuid_raw+0x70/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.879349]  [<ffffffff810ac101>] ? try_to_wake_up+0x361/0x3b0
Aug  6 11:47:33 hostname kernel: [688504.879593]  [<ffffffffc065b999>] ? nv_open_device+0x579/0x700 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.879839]  [<ffffffffc065be8d>] ? nvidia_open+0x14d/0x2f0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.880084]  [<ffffffffc065a328>] ? nvidia_frontend_open+0x58/0xa0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.880093]  [<ffffffff8121218f>] ? chrdev_open+0xbf/0x1b0
Aug  6 11:47:33 hostname kernel: [688504.880109]  [<ffffffff8120b2ef>] ? do_dentry_open+0x1ff/0x310
Aug  6 11:47:33 hostname kernel: [688504.880114]  [<ffffffff812120d0>] ? cdev_put+0x30/0x30
Aug  6 11:47:33 hostname kernel: [688504.880120]  [<ffffffff8120c484>] ? vfs_open+0x54/0x80
Aug  6 11:47:33 hostname kernel: [688504.880127]  [<ffffffff812180eb>] ? may_open+0x5b/0xf0
Aug  6 11:47:33 hostname kernel: [688504.880135]  [<ffffffff8121bc97>] ? path_openat+0x1b7/0x1330
Aug  6 11:47:33 hostname kernel: [688504.880141]  [<ffffffff8121cf84>] ? putname+0x54/0x60
Aug  6 11:47:33 hostname kernel: [688504.880149]  [<ffffffff8121e001>] ? do_filp_open+0x91/0x100
Aug  6 11:47:33 hostname kernel: [688504.880157]  [<ffffffff8122b8c6>] ? __alloc_fd+0x46/0x190
Aug  6 11:47:33 hostname kernel: [688504.880162]  [<ffffffff8120c858>] ? do_sys_open+0x138/0x2a0
Aug  6 11:47:33 hostname kernel: [688504.880168]  [<ffffffff8120c9de>] ? SyS_open+0x1e/0x20
Aug  6 11:47:33 hostname kernel: [688504.880174]  [<ffffffff818318b2>] ? entry_SYSCALL_64_fastpath+0x16/0x71
Aug  6 11:47:33 hostname kernel: [688504.880182] INFO: task python:12956 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.880209]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.880220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.880228] python          D ffff8803c793bad8     0 12956  12955 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.880249]  ffff8803c793bad8 00000000022152c0 ffffffff81e11500 ffff880421f249c0
Aug  6 11:47:33 hostname kernel: [688504.880266]  ffff8803c793c000 ffffffffc11740c0 ffff880421f249c0 ffff880423755b00
Aug  6 11:47:33 hostname kernel: [688504.880271]  ffffffff821cd240 ffff8803c793baf0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.880275] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.880281]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.880286]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.880291]  [<ffffffff81225557>] ? __d_instantiate+0x97/0xf0
Aug  6 11:47:33 hostname kernel: [688504.880295]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.880299]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.880495]  [<ffffffffc065a2f5>] nvidia_frontend_open+0x25/0xa0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.880511]  [<ffffffff8121218f>] chrdev_open+0xbf/0x1b0
Aug  6 11:47:33 hostname kernel: [688504.880520]  [<ffffffff8120b2ef>] do_dentry_open+0x1ff/0x310
Aug  6 11:47:33 hostname kernel: [688504.880533]  [<ffffffff812120d0>] ? cdev_put+0x30/0x30
Aug  6 11:47:33 hostname kernel: [688504.880537]  [<ffffffff8120c484>] vfs_open+0x54/0x80
Aug  6 11:47:33 hostname kernel: [688504.880542]  [<ffffffff812180eb>] ? may_open+0x5b/0xf0
Aug  6 11:47:33 hostname kernel: [688504.880548]  [<ffffffff8121bc97>] path_openat+0x1b7/0x1330
Aug  6 11:47:33 hostname kernel: [688504.880554]  [<ffffffff8121cf84>] ? putname+0x54/0x60
Aug  6 11:47:33 hostname kernel: [688504.880560]  [<ffffffff8121e001>] do_filp_open+0x91/0x100
Aug  6 11:47:33 hostname kernel: [688504.880564]  [<ffffffff8122b8c6>] ? __alloc_fd+0x46/0x190
Aug  6 11:47:33 hostname kernel: [688504.880569]  [<ffffffff8120c858>] do_sys_open+0x138/0x2a0
Aug  6 11:47:33 hostname kernel: [688504.880575]  [<ffffffff8120c9de>] SyS_open+0x1e/0x20
Aug  6 11:47:33 hostname kernel: [688504.880584]  [<ffffffff818318b2>] entry_SYSCALL_64_fastpath+0x16/0x71
Aug  6 11:47:33 hostname kernel: [688504.880596] INFO: task kworker/4:0:13016 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.880604]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.880610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.880612] kworker/4:0     D ffff8803b7c5bb88     0 13016      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.880817] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.880826]  ffff8803b7c5bb88 ffff8803b7c5bb58 ffff8800ad79c9c0 ffff8800ad79bb00
Aug  6 11:47:33 hostname kernel: [688504.880830]  ffff8803b7c5c000 ffff8800bffe0e28 ffff8800ad79bb00 ffff8803b7c5be18
Aug  6 11:47:33 hostname kernel: [688504.880844]  ffff8803fe8cf7c8 ffff8803b7c5bba0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.880848] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.880858]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.880862]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.880869]  [<ffffffff8182d116>] ? __schedule+0x3b6/0xa30
Aug  6 11:47:33 hostname kernel: [688504.880873]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.880878]  [<ffffffff810f5d00>] ? __getnstimeofday64+0x60/0xd0
Aug  6 11:47:33 hostname kernel: [688504.880882]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.881083]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.881269]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.881637]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.881959]  [<ffffffffc0803799>] ? _nv012470rm+0x29/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.882298]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.882637]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.882825]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.883010]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.883018]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.883023]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.883027]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.883031]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.883036]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.883041]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.883044]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.883049] INFO: task kworker/4:1:13017 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.883052]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.883056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.883058] kworker/4:1     D ffff88040443fb88     0 13017      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.883258] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.883263]  ffff88040443fb88 0000000000000001 ffff8800ad79d880 ffff8800ad79c9c0
Aug  6 11:47:33 hostname kernel: [688504.883267]  ffff880404440000 ffff8800bffe0e28 ffff8800ad79c9c0 ffff88040443fe18
Aug  6 11:47:33 hostname kernel: [688504.883271]  ffff8803fe8cf108 ffff88040443fba0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.883275] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.883281]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.883287]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.883293]  [<ffffffff810b27bc>] ? __enqueue_entity+0x6c/0x70
Aug  6 11:47:33 hostname kernel: [688504.883296]  [<ffffffff810b9597>] ? put_prev_entity+0x97/0x7d0
Aug  6 11:47:33 hostname kernel: [688504.883302]  [<ffffffff8102d66c>] ? __switch_to+0x1dc/0x5c0
Aug  6 11:47:33 hostname kernel: [688504.883307]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.883312]  [<ffffffff810f5d00>] ? __getnstimeofday64+0x60/0xd0
Aug  6 11:47:33 hostname kernel: [688504.883315]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.883514]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.883713]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.884084]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.884405]  [<ffffffffc0803799>] ? _nv012470rm+0x29/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.884749]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.885099]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.885295]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.885489]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.885504]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.885514]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.885529]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.885543]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.885551]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.885565]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.885573]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.885584] INFO: task kworker/4:4:13018 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.885592]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.885599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.885609] kworker/4:4     D ffff880404433b88     0 13018      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.885809] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.885816]  ffff880404433b88 0000000000000001 ffff8800ad79e740 ffff8800ad79d880
Aug  6 11:47:33 hostname kernel: [688504.885824]  ffff880404434000 ffff8800bffe0e28 ffff8800ad79d880 ffff880404433e18
Aug  6 11:47:33 hostname kernel: [688504.885830]  ffff8803fe8cf0c8 ffff880404433ba0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.885840] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.885846]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.885850]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.885856]  [<ffffffff810b27bc>] ? __enqueue_entity+0x6c/0x70
Aug  6 11:47:33 hostname kernel: [688504.885861]  [<ffffffff810b9597>] ? put_prev_entity+0x97/0x7d0
Aug  6 11:47:33 hostname kernel: [688504.885865]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.885870]  [<ffffffff810f5d00>] ? __getnstimeofday64+0x60/0xd0
Aug  6 11:47:33 hostname kernel: [688504.885873]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.886059]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.886244]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.886613]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.886936]  [<ffffffffc0803799>] ? _nv012470rm+0x29/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.887274]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.887612]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.887815]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888002]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888010]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.888014]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.888019]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.888023]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.888028]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.888033]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.888036]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.888040] INFO: task kworker/4:5:13019 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.888042]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.888046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.888048] kworker/4:5     D ffff8804237f7b88     0 13019      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.888237] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888242]  ffff8804237f7b88 ffffffff810b9535 ffff8800ad798000 ffff8800ad79e740
Aug  6 11:47:33 hostname kernel: [688504.888244]  ffff8804237f8000 ffff8800bffe0e28 ffff8800ad79e740 ffff8804237f7e18
Aug  6 11:47:33 hostname kernel: [688504.888247]  ffff8803fe8cf088 ffff8804237f7ba0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.888250] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.888252]  [<ffffffff810b9535>] ? put_prev_entity+0x35/0x7d0
Aug  6 11:47:33 hostname kernel: [688504.888257]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.888259]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.888262]  [<ffffffff8182d116>] ? __schedule+0x3b6/0xa30
Aug  6 11:47:33 hostname kernel: [688504.888265]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.888268]  [<ffffffff810f5d00>] ? __getnstimeofday64+0x60/0xd0
Aug  6 11:47:33 hostname kernel: [688504.888271]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.888365]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888431]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888557]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888666]  [<ffffffffc0803799>] ? _nv012470rm+0x29/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888781]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888896]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.888966]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889030]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889035]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.889038]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.889042]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.889045]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.889048]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.889049]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.889051]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.889052] INFO: task kworker/4:6:13020 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.889053]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.889054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.889055] kworker/4:6     D ffff88041f4a7b88     0 13020      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.889116] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889121]  ffff88041f4a7b88 ffffffff810b9535 ffff8800ad798ec0 ffff8800ad798000
Aug  6 11:47:33 hostname kernel: [688504.889125]  ffff88041f4a8000 ffff8800bffe0e28 ffff8800ad798000 ffff88041f4a7e18
Aug  6 11:47:33 hostname kernel: [688504.889126]  ffff8803fe8cf048 ffff88041f4a7ba0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.889127] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.889129]  [<ffffffff810b9535>] ? put_prev_entity+0x35/0x7d0
Aug  6 11:47:33 hostname kernel: [688504.889132]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.889136]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.889137]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.889139]  [<ffffffff810f5d00>] ? __getnstimeofday64+0x60/0xd0
Aug  6 11:47:33 hostname kernel: [688504.889143]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.889204]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889267]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889390]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889496]  [<ffffffffc0803799>] ? _nv012470rm+0x29/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889609]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889722]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889784]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889846]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889848]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.889850]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.889851]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.889852]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.889854]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.889856]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.889857]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.889858] INFO: task kworker/4:7:13021 blocked for more than 120 seconds.
Aug  6 11:47:33 hostname kernel: [688504.889859]       Tainted: P           OE   4.4.0-45-generic #66-Ubuntu
Aug  6 11:47:33 hostname kernel: [688504.889860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 11:47:33 hostname kernel: [688504.889861] kworker/4:7     D ffff88042476fb88     0 13021      2 0x00000000
Aug  6 11:47:33 hostname kernel: [688504.889922] Workqueue: events os_execute_work_item [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.889923]  ffff88042476fb88 ffff88042476fb58 ffff880421f25880 ffff8800ad798ec0
Aug  6 11:47:33 hostname kernel: [688504.889925]  ffff880424770000 ffff8800bffe0e28 ffff8800ad798ec0 ffff88042476fe18
Aug  6 11:47:33 hostname kernel: [688504.889926]  ffff8803fe8cf008 ffff88042476fba0 ffffffff8182d7c5 7fffffffffffffff
Aug  6 11:47:33 hostname kernel: [688504.889928] Call Trace:
Aug  6 11:47:33 hostname kernel: [688504.889930]  [<ffffffff8182d7c5>] schedule+0x35/0x80
Aug  6 11:47:33 hostname kernel: [688504.889931]  [<ffffffff818308e5>] schedule_timeout+0x1b5/0x270
Aug  6 11:47:33 hostname kernel: [688504.889933]  [<ffffffff8182d116>] ? __schedule+0x3b6/0xa30
Aug  6 11:47:33 hostname kernel: [688504.889934]  [<ffffffff8182f87f>] __down+0x7f/0xd0
Aug  6 11:47:33 hostname kernel: [688504.889936]  [<ffffffff810f5d00>] ? __getnstimeofday64+0x60/0xd0
Aug  6 11:47:33 hostname kernel: [688504.889938]  [<ffffffff810ca131>] down+0x41/0x50
Aug  6 11:47:33 hostname kernel: [688504.889999]  [<ffffffffc0665d87>] os_acquire_semaphore+0x37/0x40 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890060]  [<ffffffffc0665d9e>] os_acquire_mutex+0xe/0x10 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890183]  [<ffffffffc0c320cc>] _nv031494rm+0x5c/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890289]  [<ffffffffc0803799>] ? _nv012470rm+0x29/0x120 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890402]  [<ffffffffc0cad6db>] ? _nv001136rm+0x6b/0xd0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890514]  [<ffffffffc0cb1409>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890581]  [<ffffffffc0666100>] ? os_free_mem+0x30/0x30 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890643]  [<ffffffffc0666166>] ? os_execute_work_item+0x46/0x70 [nvidia]
Aug  6 11:47:33 hostname kernel: [688504.890646]  [<ffffffff8109a3e5>] ? process_one_work+0x165/0x480
Aug  6 11:47:33 hostname kernel: [688504.890647]  [<ffffffff8109a74b>] ? worker_thread+0x4b/0x4c0
Aug  6 11:47:33 hostname kernel: [688504.890649]  [<ffffffff8109a700>] ? process_one_work+0x480/0x480
Aug  6 11:47:33 hostname kernel: [688504.890650]  [<ffffffff810a0928>] ? kthread+0xd8/0xf0
Aug  6 11:47:33 hostname kernel: [688504.890652]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
Aug  6 11:47:33 hostname kernel: [688504.890653]  [<ffffffff81831c4f>] ? ret_from_fork+0x3f/0x70
Aug  6 11:47:33 hostname kernel: [688504.890654]  [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0

The Xid error reported in the log is significant, and is likely related to the problem. If you do a google search on “NVIDIA Xid error” you can find documentation on it as well as a listing of error types. However it will not be enough to indicate exactly what is wrong or what is happening.

If you can develop a deterministic reproducible test case, then it’s probably possible to sort it out. But I can’t offer much advice about a problem that happens occasionally, from just the log. There is a potentially long list of things that could be mentioned, like make sure the GPU is not getting too hot, make sure you have a good power supply/power delivery, etc. etc. but these sort of recommendations are already littered about. I don’t have anything to suggest based specifically on that log.

Thank you for your reply!

A google search on “NVidia Xid error” helped me a lot!

While I investigated about the “Xid 44” error and the suggestions you gave, I couldn’t get the clue yet.
The GPU temperature is not so hot (about 50 C) and the power usage is not too high (about 80 W) even at peak time.
But these values may spike up at the occasion, I’ll monitor them periodically.

According to the official list of Xid codes ([url]XID Errors :: GPU Deployment and Management Documentation), code 44 is:

Graphics Engine fault during context switch

Context switch sounds like something the OS would do with help from the NVIDIA drivers, so this could be a software issue. An issue directly related to hardware (e.g. insufficient power supply, high temperature) seems unlikely, but cannot be excluded for sure given that we have no detailed description what gives rise to this error.

I will point out (like I did a gazillion times before) that Ubuntu systems constitute the vast majority (think 98%) of Linux distros in public reports of weird errors with CUDA. This may be due to the popularity of this distro but I would attribute it at least partially to the quality of that particular distro (in the Linux world, the Ubuntu folks are the people who “think differently”).