Dear,
We have met GPU random hang on some GPU recently.
When the issue happened, GPU util is fixed to 100%, but, the no graphic or cuda is still working. On dmesg I got error like
[369880.646296] INFO: task nv_queue:2351 blocked for more than 120 seconds.
[369880.646299] Tainted: P OE 4.15.0-43-generic #46~16.04.1-Ubuntu
[369880.646300] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[369880.646301] nv_queue D 0 2351 2 0x80000000
[369880.646302] Call Trace:
[369880.646307] __schedule+0x3d6/0x8b0
[369880.646309] ? __enqueue_entity+0x5c/0x60
[369880.646311] schedule+0x36/0x80
[369880.646312] schedule_timeout+0x1db/0x370
[369880.646441] ? os_acquire_spinlock+0x12/0x20 [nvidia]
[369880.646444] ? check_preempt_curr+0x54/0x90
[369880.646445] ? ttwu_do_wakeup+0x1e/0x150
[369880.646447] __down+0x8a/0xe0
[369880.646448] down+0x41/0x50
[369880.646449] ? down+0x41/0x50
[369880.646536] os_acquire_semaphore+0x38/0x40 [nvidia]
[369880.646588] _nv008372rm+0x4f6/0x670 [nvidia]
[369880.646640] ? _nv034059rm+0x45/0xc0 [nvidia]
[369880.646771] ? _nv007815rm+0x53/0x170 [nvidia]
[369880.646853] ? rm_execute_work_item+0x3d/0xc0 [nvidia]
[369880.646902] ? os_execute_work_item+0x4a/0x60 [nvidia]
[369880.646951] ? _main_loop+0x94/0x140 [nvidia]
[369880.646955] ? kthread+0x105/0x140
[369880.647003] ? _raw_q_schedule+0x70/0x70 [nvidia]
[369880.647005] ? kthread_destroy_worker+0x50/0x50
[369880.647006] ? ret_from_fork+0x35/0x40
[369880.647042] INFO: task PhoneFinder:32456 blocked for more than 120 seconds.
[369880.647044] Tainted: P OE 4.15.0-43-generic #46~16.04.1-Ubuntu
The full bug report is attached.
nvidia-bug-report.log.gz (1.1 MB)
And then, we do GPU-burn test, no issue happened for a long time. Seems is not hardware issue.
Could you give any suggestion? How to workaround or solve the issue?
Thanks