384.90 on Centos7.3 would sometimes block CPU

Guys,

I am running 384.90 on CentOs7.3
Following is my system information:

# uname -a
Linux ennew-gpu-centos-151 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# nvidia-smi
Wed Nov  1 01:55:50 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   46C    P0    40W / 180W |      0MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   43C    P0    36W / 180W |      0MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But sometimes, running nvidia-smi would hang and become “Dead” process.

Oct 31 10:10:09 ennew-gpu-centos-151 kernel: INFO: task nvidia-smi:9868 blocked for more than 120 seconds.
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: nvidia-smi      D ffff880ff919dee0     0  9868   1551 0x00000080
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: ffff880f267db940 0000000000000086 ffff880ff919dee0 ffff880f267dbfd8
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: ffff880f267dbfd8 ffff880f267dbfd8 ffff880ff919dee0 ffff880ffdad17c8
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: 7fffffffffffffff 0000000000000002 0000000000000000 ffff880ff919dee0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: Call Trace:
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff816a9589>] schedule+0x29/0x70
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0ae1178>] ? _nv020163rm+0x8/0x40 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0ae1b3d>] ? _nv020212rm+0xd/0x30 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0ae1ca4>] ? _nv020246rm+0x34/0xd0 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff810ea8da>] ? __getnstimeofday64+0x3a/0xd0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff816a8927>] __down_common+0xaa/0x104
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc054e300>] ? os_free_semaphore+0x10/0x10 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff816a899e>] __down+0x1d/0x1f
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff810b66a1>] down+0x41/0x50
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc054dea2>] os_acquire_mutex+0x42/0x50 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0b0c1bc>] _nv031396rm+0x5c/0x120 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0b8c220>] ? rm_get_gpu_uuid_raw+0x70/0x120 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff810c4701>] ? try_to_wake_up+0x2e1/0x340
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0543228>] ? nv_open_device+0x5b8/0x770 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff811df4c5>] ? kmem_cache_alloc+0x35/0x1e0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc05437fc>] ? nvidia_open+0x14c/0x300 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffffc0541382>] ? nvidia_frontend_open+0x52/0xb0 [nvidia]
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff81205fe2>] ? chrdev_open+0xb2/0x1b0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff811fe5e7>] ? do_dentry_open+0x1a7/0x2e0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff812b1fbc>] ? security_inode_permission+0x1c/0x30
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff81205f30>] ? cdev_put+0x30/0x30
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff811fe7ba>] ? vfs_open+0x5a/0xb0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff8120c398>] ? may_open+0x68/0x110
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff8120f80d>] ? do_last+0x1ed/0x12c0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff812109a2>] ? path_openat+0xc2/0x490
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff81212e42>] ? user_path_at_empty+0x72/0xc0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff81212f3b>] ? do_filp_open+0x4b/0xb0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff8122019a>] ? __alloc_fd+0x8a/0x130
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff811ffb83>] ? do_sys_open+0xf3/0x1f0
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff811ffc9e>] ? SyS_open+0x1e/0x20
Oct 31 10:10:09 ennew-gpu-centos-151 kernel: [<ffffffff816b5089>] ? system_call_fastpath+0x16/0x1b

I tried running nvidia-smi for 1000 times in a roll, no such hang found. I am not sure what would trigger this. Would it be hardware issue? or do I need to upgrade the driver to 387.xx.

Thanks~!
David

I caught another crash, but with different call stacks.

Nov  1 08:26:38 ennew-gpu-centos-147 kernel: CPU: 0 PID: 1266 Comm: nvidia-smi Tainted: P           OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 2003 09/19/2016
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: ffff881046c03d88 000000000ff9720b ffff881046c03d38 ffffffff816a3e51
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: ffff881046c03d78 ffffffff810879d8 0000012c46c1a800 0000000000000000
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: ffff8810016f0000 0000000000000001 ffff88100280f680 0000000000000000
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: Call Trace:
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: <IRQ>  [<ffffffff816a3e51>] dump_stack+0x19/0x1b
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff810879d8>] __warn+0xd8/0x100
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff81087a5f>] warn_slowpath_fmt+0x5f/0x80
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff815af402>] dev_watchdog+0x242/0x250
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff815af1c0>] ? dev_deactivate_queue.constprop.33+0x60/0x60
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff81097326>] call_timer_fn+0x36/0x110
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff815af1c0>] ? dev_deactivate_queue.constprop.33+0x60/0x60
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff8109983d>] run_timer_softirq+0x22d/0x310
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff81090b4f>] __do_softirq+0xef/0x280
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff816b6b1c>] call_softirq+0x1c/0x30
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff8102d3c5>] do_softirq+0x65/0xa0
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff81090ed5>] irq_exit+0x105/0x110
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff816b7782>] smp_apic_timer_interrupt+0x42/0x50
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff816b5cdd>] apic_timer_interrupt+0x6d/0x80
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: <EOI>  [<ffffffffc0bbfc00>] ? _nv000659rm+0x20/0x20 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0576b2e>] ? os_io_write_dword+0xe/0x10 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0bc95b0>] _nv035897rm+0x20/0x60 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0bc02e5>] ? _nv001359rm+0x85/0xb0 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0bc00c4>] ? _nv027237rm+0x164/0x200 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0ba8f0d>] ? _nv028424rm+0x4d/0x140 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0bad724>] ? _nv001232rm+0x344/0x430 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0bada30>] ? _nv001078rm+0x220/0x3c0 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0bbaa59>] ? _nv001099rm+0x299/0x330 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0baf59a>] ? rm_disable_adapter+0x6a/0x130 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff810b6801>] ? up+0x31/0x50
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0569e3c>] ? nv_shutdown_adapter+0x1c/0x90 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc0569f2a>] ? nv_close_device+0x7a/0x170 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc056e39a>] ? nvidia_close+0xda/0x3a0 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffffc056940c>] ? nvidia_frontend_close+0x2c/0x50 [nvidia]
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff81202f39>] ? __fput+0xe9/0x260
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff8120319e>] ? ____fput+0xe/0x10
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff810ad257>] ? task_work_run+0xa7/0xf0
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff8102ab62>] ? do_notify_resume+0x92/0xb0
Nov  1 08:26:38 ennew-gpu-centos-147 kernel: [<ffffffff816b533d>] ? int_signal+0x12/0x17

Could this be some hardware issue? I have two “GeForce GTX 1080” installed on each machine, could it be that some other components, e.g. ethernet card, would interference with the stability of GPU?

Thanks~
David

I’m not really sure about this but both backtraces look like originating from filesystem level. Centos is using a rather old kernel though stuffed with backports and you’re running a rather new Sky/Kabylake system.

It happened when calling “nvidia-smi” which would open/ioctl the device file under /dev, hence it would be normal to see the call stack originating from vfs layer.

I have tested with Coreos 1465.8.0,

$ uname -r
4.12.14-coreos

Same thing happened with higher probability.

Some update:

I turned off the tso and gso flag on my ethenet card since I noticed sometimes error log about e1000e show up along with nvidia call trace, strangely my 5-nodes system has run stably for morn than 24h without any “Dead” process.

I will keep monitoring my system and update later.

Sadly, dead process still happened.

Nov  8 10:52:16 ennew-gpu-centos-147 kernel: INFO: task nvidia-smi:28641 blocked for more than 120 seconds.
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: nvidia-smi      D ffff88078f623f40     0 28641  12417 0x00000080
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: ffff8805a2b43b30 0000000000000082 ffff88078f623f40 ffff8805a2b43fd8
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: ffff8805a2b43fd8 ffff8805a2b43fd8 ffff88078f623f40 ffff881003fb4268
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: 7fffffffffffffff 0000000000000002 0000000000000000 ffff88078f623f40
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: Call Trace:
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff816a9589>] schedule+0x29/0x70
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff810ce906>] ? check_preempt_wakeup+0x166/0x250
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff810c12e5>] ? check_preempt_curr+0x85/0xa0
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff816a8927>] __down_common+0xaa/0x104
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff810ea97e>] ? getnstimeofday64+0xe/0x30
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff816a899e>] __down+0x1d/0x1f
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff810b66a1>] down+0x41/0x50
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc06ceea2>] os_acquire_mutex+0x42/0x50 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc0c8d1bc>] _nv031396rm+0x5c/0x120 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc0c8b70e>] ? _nv007599rm+0x20e/0x2a0 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc0c8b7b2>] ? _nv001091rm+0x12/0x20 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc0c77384>] ? _nv006820rm+0x64/0xa0 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc0cffae8>] ? _nv001193rm+0x5e8/0x880 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc0d09f53>] ? rm_ioctl+0x73/0x100 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc06c6d00>] ? nvidia_ioctl+0x60/0x5e0 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc06c6e55>] ? nvidia_ioctl+0x1b5/0x5e0 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffffc06c2081>] ? nvidia_frontend_unlocked_ioctl+0x41/0x50 [nvidia]
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff812151bd>] ? do_vfs_ioctl+0x33d/0x540
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff816b0091>] ? __do_page_fault+0x171/0x450
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff81215461>] ? SyS_ioctl+0xa1/0xc0
Nov  8 10:52:16 ennew-gpu-centos-147 kernel: [<ffffffff816b5089>] ? system_call_fastpath+0x16/0x1b

Just for update, It seems like a power supply stability issue.
After enable persistent mode, via ‘nvidia-smi -pm 1’, all nodes have be running stably for a long while.

My guess is that keeping GPU “warm up” would make critical kernel call path return much more stable.

Thanks~!
David