410.78 driver, GPUs will lock up

We have some GPUs which randomly lock up and become non-responsive

these GPUs are running tensorflow jobs, and after working for some time, will stop responding.

the HW sku’s are Titan XP and 1080Ti’s.

The driver version is 410.78, cuda 10, Debian Jessie 8, Linux Kernel 4.4.92

When the GPUs finally respond, we see ERR! against fan speed and Current PowerWatt usage, other metrics report fine.

i am not sure what can be done to help mitigate this problem.

if there is more information needed, i will be happy to provide them.
nvidia-bug-report.log.gz (2.21 MB)

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

attached to the original post.

At some time the gpu is hanging so the kernel ooopses

[858858.756074] INFO: task python2.7:23708 blocked for more than 120 seconds.
[858858.762961]       Tainted: P        W  OE   4.4.92 #1
[858858.768120] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[858858.776034] python2.7       D ffff881fffc75d40     0 23708  23466 0x00000006
[858858.783201]  ffff8834766427c0 ffff881ff2c89a80 ffff88347ce20000 ffff88347ce1fb98
[858858.790758]  ffffffffc15d9e5c ffff8834766427c0 00000000ffffffff ffffffffc15d9e60
[858858.798315]  ffffffff815a3971 ffffffffc15d9e58 ffffffff815a3bfa ffffffff815a57a4
[858858.805858] Call Trace:
[858858.808397]  [<ffffffff815a3971>] ? schedule+0x31/0x80
[858858.813619]  [<ffffffff815a3bfa>] ? schedule_preempt_disabled+0xa/0x10
[858858.820248]  [<ffffffff815a57a4>] ? __mutex_lock_slowpath+0xb4/0x120
[858858.826686]  [<ffffffff815a582b>] ? mutex_lock+0x1b/0x30
[858858.832145]  [<ffffffffc154b435>] ? uvm_gpu_release+0x15/0x30 [nvidia_uvm]
[858858.839103]  [<ffffffffc154f8b2>] ? uvm_deferred_free_object_list+0x52/0xf0 [nvidia_uvm]
[858858.847284]  [<ffffffffc154fbf6>] ? uvm_va_space_destroy+0x2a6/0x3c0 [nvidia_uvm]
[858858.854854]  [<ffffffffc154265d>] ? uvm_release+0xd/0x20 [nvidia_uvm]
[858858.861384]  [<ffffffff811dff7a>] ? __fput+0xca/0x1d0
[858858.866525]  [<ffffffff81095b25>] ? task_work_run+0x75/0x90
[858858.872184]  [<ffffffff8107c145>] ? do_exit+0x385/0xb10
[858858.877659]  [<ffffffffc0a28476>] ? _nv036631rm+0xa6/0x140 [nvidia]
[858858.884020]  [<ffffffff8107c949>] ? do_group_exit+0x39/0xb0
[858858.889681]  [<ffffffff81087b7e>] ? get_signal+0x2be/0x6b0
[858858.895253]  [<ffffffff810165d6>] ? do_signal+0x36/0x6d0
[858858.900652]  [<ffffffff811c4300>] ? __kmalloc+0x120/0x1a0
[858858.906242]  [<ffffffffc040706c>] ? nvidia_frontend_unlocked_ioctl+0x3c/0x40 [nvidia]
[858858.914173]  [<ffffffff811f1a86>] ? do_vfs_ioctl+0x2d6/0x4b0
[858858.919919]  [<ffffffff810031e5>] ? exit_to_usermode_loop+0x85/0xc0
[858858.926288]  [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110
[858858.932917]  [<ffffffff815a7898>] ? int_ret_from_sys_call+0x25/0x8f
[858858.939269] INFO: task python2.7:23712 blocked for more than 120 seconds.
[858858.946152]       Tainted: P        W  OE   4.4.92 #1
[858858.951289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[858858.959204] python2.7       D ffff881fffcf5d40     0 23712  23471 0x00000006
[858858.966376]  ffff883fecb0ea00 ffff881ff2c8cf80 ffff883f0d178000 ffff883f0d177be0
[858858.973922]  ffffffffc15d9e5c ffff883fecb0ea00 00000000ffffffff ffffffffc15d9e60
[858858.981470]  ffffffff815a3971 ffffffffc15d9e58 ffffffff815a3bfa ffffffff815a57a4
[858858.989011] Call Trace:
[858858.991544]  [<ffffffff815a3971>] ? schedule+0x31/0x80
[858858.996767]  [<ffffffff815a3bfa>] ? schedule_preempt_disabled+0xa/0x10
[858859.003378]  [<ffffffff815a57a4>] ? __mutex_lock_slowpath+0xb4/0x120
[858859.009818]  [<ffffffff815a582b>] ? mutex_lock+0x1b/0x30
[858859.015239]  [<ffffffffc154fc02>] ? uvm_va_space_destroy+0x2b2/0x3c0 [nvidia_uvm]
[858859.022807]  [<ffffffffc154265d>] ? uvm_release+0xd/0x20 [nvidia_uvm]
[858859.029347]  [<ffffffff811dff7a>] ? __fput+0xca/0x1d0
[858859.034484]  [<ffffffff81095b25>] ? task_work_run+0x75/0x90
[858859.040153]  [<ffffffff8107c145>] ? do_exit+0x385/0xb10
[858859.045485]  [<ffffffff8107c949>] ? do_group_exit+0x39/0xb0
[858859.051147]  [<ffffffff81087b7e>] ? get_signal+0x2be/0x6b0
[858859.056715]  [<ffffffff810165d6>] ? do_signal+0x36/0x6d0
[858859.062114]  [<ffffffff811a1585>] ? do_mmap+0x335/0x420
[858859.067428]  [<ffffffff811f1a86>] ? do_vfs_ioctl+0x2d6/0x4b0
[858859.073173]  [<ffffffff810031e5>] ? exit_to_usermode_loop+0x85/0xc0
[858859.079542]  [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110
[858859.086155]  [<ffffffff815a7898>] ? int_ret_from_sys_call+0x25/0x8f
[858859.092509] INFO: task python2.7:23718 blocked for more than 120 seconds.
[858859.099382]       Tainted: P        W  OE   4.4.92 #1
[858859.104519] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[858859.112424] python2.7       D ffff881fffc35d40     0 23718  23480 0x00000006
[858859.119587]  ffff883fecb0a7c0 ffff881ff2c88000 ffff883e9243c000 ffff883e9243bb98
[858859.127127]  ffffffffc15d9e5c ffff883fecb0a7c0 00000000ffffffff ffffffffc15d9e60
[858859.134666]  ffffffff815a3971 ffffffffc15d9e58 ffffffff815a3bfa ffffffff815a57a4
[858859.142216] Call Trace:
[858859.144753]  [<ffffffff815a3971>] ? schedule+0x31/0x80
[858859.149978]  [<ffffffff815a3bfa>] ? schedule_preempt_disabled+0xa/0x10
[858859.156590]  [<ffffffff815a57a4>] ? __mutex_lock_slowpath+0xb4/0x120
[858859.163022]  [<ffffffff815a582b>] ? mutex_lock+0x1b/0x30
[858859.168445]  [<ffffffffc154b435>] ? uvm_gpu_release+0x15/0x30 [nvidia_uvm]
[858859.175404]  [<ffffffffc154f8b2>] ? uvm_deferred_free_object_list+0x52/0xf0 [nvidia_uvm]
[858859.183575]  [<ffffffffc154fbf6>] ? uvm_va_space_destroy+0x2a6/0x3c0 [nvidia_uvm]
[858859.191142]  [<ffffffffc154265d>] ? uvm_release+0xd/0x20 [nvidia_uvm]
[858859.197663]  [<ffffffff811dff7a>] ? __fput+0xca/0x1d0
[858859.202804]  [<ffffffff81095b25>] ? task_work_run+0x75/0x90
[858859.208493]  [<ffffffff8107c145>] ? do_exit+0x385/0xb10
[858859.213981]  [<ffffffffc0a28d60>] ? _nv033856rm+0x90/0xd0 [nvidia]
[858859.220382]  [<ffffffffc09bd024>] ? _nv007841rm+0x174/0x1e0 [nvidia]
[858859.226822]  [<ffffffff8107c949>] ? do_group_exit+0x39/0xb0
[858859.232478]  [<ffffffff81087b7e>] ? get_signal+0x2be/0x6b0
[858859.238052]  [<ffffffff810165d6>] ? do_signal+0x36/0x6d0
[858859.243515]  [<ffffffffc040706c>] ? nvidia_frontend_unlocked_ioctl+0x3c/0x40 [nvidia]
[858859.251428]  [<ffffffff811f1a86>] ? do_vfs_ioctl+0x2d6/0x4b0
[858859.257170]  [<ffffffff810031e5>] ? exit_to_usermode_loop+0x85/0xc0
[858859.263524]  [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110
[858859.270140]  [<ffffffff815a7898>] ? int_ret_from_sys_call+0x25/0x8f

but there’s no clear message from the driver. Did you try monitoring temperatures using nvidia-smi?

we do monitor the gpu temps, and they are ~ 80 Celcius.

also, seems like those oopses are because of the gpu becoming non-responsive.

Ok, temperatures should be ok then but the reasons for the hangs are still unclear. If the gpus that fail are always the same, use cuda-memtest and gpu-burn to test for general hw failure. Also send the log to linux-bugs[at]nvidia.com

we did see a funny turn of events here, as soon as we stop all the processes accessing the device, /dev/nvidia*, and reset the GPU in question using, nvidia-smi --gpu-reset -i , things go back to normal.

is there a tool or debug flow i can use to see what is causing the events in question?

I just noticed that you’re running headless and don’t have the nvidia-persistenced started. This could lead to those uvm related hangs due to the driver deinitializing. Please set the nvidia-persistenced to start on boot and check if that resolves the issue.