440.84 driver, GPU lock up

Bob_DL · August 20, 2020, 1:33am

Dear,
We have met GPU random hang on some GPU recently.
When the issue happened, GPU util is fixed to 100%, but, the no graphic or cuda is still working. On dmesg I got error like
[369880.646296] INFO: task nv_queue:2351 blocked for more than 120 seconds.

[369880.646299] Tainted: P OE 4.15.0-43-generic #46~16.04.1-Ubuntu

[369880.646300] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.

[369880.646301] nv_queue D 0 2351 2 0x80000000

[369880.646302] Call Trace:

[369880.646307] __schedule+0x3d6/0x8b0

[369880.646309] ? __enqueue_entity+0x5c/0x60

[369880.646311] schedule+0x36/0x80

[369880.646312] schedule_timeout+0x1db/0x370

[369880.646441] ? os_acquire_spinlock+0x12/0x20 [nvidia]

[369880.646444] ? check_preempt_curr+0x54/0x90

[369880.646445] ? ttwu_do_wakeup+0x1e/0x150

[369880.646447] __down+0x8a/0xe0

[369880.646448] down+0x41/0x50

[369880.646449] ? down+0x41/0x50

[369880.646536] os_acquire_semaphore+0x38/0x40 [nvidia]

[369880.646588] _nv008372rm+0x4f6/0x670 [nvidia]

[369880.646640] ? _nv034059rm+0x45/0xc0 [nvidia]

[369880.646771] ? _nv007815rm+0x53/0x170 [nvidia]

[369880.646853] ? rm_execute_work_item+0x3d/0xc0 [nvidia]

[369880.646902] ? os_execute_work_item+0x4a/0x60 [nvidia]

[369880.646951] ? _main_loop+0x94/0x140 [nvidia]

[369880.646955] ? kthread+0x105/0x140

[369880.647003] ? _raw_q_schedule+0x70/0x70 [nvidia]

[369880.647005] ? kthread_destroy_worker+0x50/0x50

[369880.647006] ? ret_from_fork+0x35/0x40

[369880.647042] INFO: task PhoneFinder:32456 blocked for more than 120 seconds.

[369880.647044] Tainted: P OE 4.15.0-43-generic #46~16.04.1-Ubuntu
The full bug report is attached.
nvidia-bug-report.log.gz (1.1 MB)

And then, we do GPU-burn test, no issue happened for a long time. Seems is not hardware issue.

Could you give any suggestion? How to workaround or solve the issue?

Thanks

Bob_DL · August 21, 2020, 8:07am

Hi,
Anybody could help check this issue?

Topic		Replies	Views
410.78 driver, GPUs will lock up Linux	7	2759	March 29, 2019
Kernel Task Hang on 390.48 / 396.18 (GTX770) Linux	6	1687	May 27, 2018
440.33.01 driver, process random hang with uvm_va_space_destroy Linux	14	1003	March 2, 2022
System hangs when running any cuda code with kworker blocked message CUDA Programming and Performance	0	1192	November 17, 2017
Kernel Crash: Linux 4.16 nvidia 396.24 Linux	1	1710	May 13, 2018
RTX4080S Nvidia Driver Frequent Hangs with Call Trace Linux kernel , ubuntu	0	147	November 21, 2024
reopen: resume from suspend freezes system Linux	0	555	August 29, 2017
POWER8 minsky (S822LC) nvidia stalls and kernel panic CUDA Setup and Installation	0	1081	October 17, 2017
Graphic card got stuck/hang randomly while training a neural network, nvidia-smi return error Linux kernel	0	647	May 12, 2023
UVM GPU1 BH process causing 100% CPU after standby Linux	7	9411	October 31, 2024

440.84 driver, GPU lock up

Related topics