440.33.01 driver, process random hang with uvm_va_space_destroy

xinglong940713 · April 10, 2020, 8:17am

We have some GPUs which randomly lock up process when the process exit.

these GPUs are running pytorch jobs, and randomly hang when exit.

Driver Version: 440.33.01 CUDA Version: 10.2 HW : TITAN X (Pascal)
System : Ubuntu 16.04 Kernel : 4.15.0-74-generic

The process stack like this :
[<0>] _raw_q_flush+0x6f/0x90 [nvidia_uvm]
[<0>] nv_kthread_q_flush+0x19/0x70 [nvidia_uvm]
[<0>] uvm_va_space_destroy+0x3b9/0x440 [nvidia_uvm]
[<0>] uvm_release.isra.7+0x7c/0x90 [nvidia_uvm]
[<0>] uvm_release_entry+0x4d/0xa0 [nvidia_uvm]
[<0>] __fput+0xea/0x220
[<0>] ____fput+0xe/0x10
[<0>] task_work_run+0x8a/0xb0
[<0>] do_exit+0x2e9/0xbd0
[<0>] do_group_exit+0x43/0xb0
[<0>] get_signal+0x169/0x820
[<0>] do_signal+0x37/0x730
[<0>] exit_to_usermode_loop+0x80/0xd0
[<0>] do_syscall_64+0x100/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

if there is more information needed, i will be happy to provide them.nvidia-bug-report.log (3.7 MB)

generix · April 10, 2020, 7:09pm

What does “some GPUs” mean? Not all are doing this? Are those the same models?

xinglong940713 · April 11, 2020, 7:14am

We have lots of machines with 8 GPU card. And this problem occurred randomly on different machine and different GPU.

When we kill the program and we found the state of program become D and hang and keep the fd of the device.

The program usually use 4 or 2 GPU, but it’s seem like one GPU has been broken by nvidia-smi show.

xinglong940713 · April 11, 2020, 7:18am

Yes, Use the same program and same models usually ok but sometime broken.

generix · April 11, 2020, 10:40am

Looking at the logs, you’re simply out of memory, processes crashing or being killed by the kernel’s oom killer. Also, you have no swap enabled.

xinglong940713 · April 11, 2020, 12:06pm

Yes, but why the process can not exit successfully and hang with uvm?

generix · April 11, 2020, 12:30pm

Though the backtrace doesn’t explicitly tell it, I suspect the kernel also needs to allocate some memory space to release it. Please check if this applies:
https://bugs.schedmd.com/show_bug.cgi?id=5092#c3

xinglong940713 · April 11, 2020, 12:47pm

Thanks! I will try it

xinglong940713 · April 11, 2020, 1:20pm

I found another process that same like that.
But it’s not been killed by oom…
Can you help me to diagnose that? Thanks a lot!
nvidia-bug-report.log (3.7 MB)

generix · April 11, 2020, 8:54pm

To make sure, do you have the nvidia-persistenced daemon correctly running? There’s a gpu without load not throttling down.

xinglong940713 · April 12, 2020, 5:40am

we don’t use nvidia-persistenced, but there are nvidia-docker-plugin running.
It’s seem like nvidia-docker-plugin also keep the fd of nvidia device.

xinglong940713 · April 13, 2020, 3:00am

gentle ping?

generix · April 13, 2020, 3:13pm

I wouldn’t count on docker keeping the driver alive. Please set up nvidia-persistenced correctly.

xinglong940713 · April 14, 2020, 6:32am

Thanks! I will try it

chenrui17 · March 2, 2022, 7:06am

@xinglong940713 hello, i met same problem with you , do you resolve it finally ?

[81976.938407] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 1): MMU NACK Errors
[81976.938411] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics SM Global Exception on (GPC 5, TPC 5, SM 1): Multiple Warp Errors
[81976.938414] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics Exception: ESR 0x52efb0=0xb090020 0x52efb4=0x4 0x52efa8=0x4c1eb72 0x52efac=0x174
[81976.966542] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics Exception: ChID 0010, Class 0000c3c0, Offset 00000510, Data 00419e84
[81976.992350] NVRM: Xid (PCI:0000:2a:00): 31, pid=54348, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
[105319.485637] mce: [Hardware Error]: Machine check events logged
[145799.382550] INFO: task hdfs_window_sou:80743 blocked for more than 120 seconds.
[145799.382553] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[145799.382554] hdfs_window_sou D 0000000000000000 0 80743 1 0x00000006
[145799.382558] Call Trace:
[145799.382569] [] schedule+0x29/0x80
[145799.382572] [] schedule_timeout+0x1f1/0x290
[145799.382578] [] ? x2apic_send_IPI_mask+0x13/0x20
[145799.382583] [] ? try_to_wake_up+0x1e9/0x300
[145799.382585] [] wait_for_completion+0x9f/0x110
[145799.382587] [] ? wake_up_state+0x20/0x20
[145799.382602] [] _raw_q_flush+0x5d/0x70 [nvidia_uvm]
[145799.382606] [] ? _raw_q_flush+0x70/0x70 [nvidia_uvm]
[145799.382611] [] nv_kthread_q_flush+0x19/0x90 [nvidia_uvm]
[145799.382619] [] uvm_va_space_destroy+0x2dc/0x420 [nvidia_uvm]
[145799.382623] [] uvm_release.isra.5+0x80/0xa0 [nvidia_uvm]
[145799.382627] [] uvm_release_entry+0x45/0xa0 [nvidia_uvm]
[145799.382631] [] __fput+0xec/0x270
[145799.382632] [] ____fput+0xe/0x10
[145799.382636] [] task_work_run+0xc4/0xe0
[145799.382641] [] do_exit+0x2c7/0xa80
[145799.382642] [] do_group_exit+0x3f/0xb0
[145799.382646] [] get_signal_to_deliver+0x1cb/0x5d0
[145799.382649] [] do_signal+0x48/0x6a0
[145799.382652] [] ? __do_page_fault+0x241/0x510
[145799.382653] [] do_notify_resume+0x5f/0xb0
[145799.382656] [] int_signal+0x12/0x17
[145919.473759] INFO: task hdfs_window_sou:80743 blocked for more than 120 seconds.

Topic		Replies	Views
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	3077	August 27, 2023
Nvidia-uvm module bug on suspend Linux	14	1754	December 7, 2023
BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend Linux driver	22	6878	November 27, 2024
RTX4080S Nvidia Driver Frequent Hangs with Call Trace Linux kernel , ubuntu	0	124	November 21, 2024
Graphic card got stuck/hang randomly while training a neural network, nvidia-smi return error Linux kernel	0	640	May 12, 2023
Nvidia drivers hang in nv_rdtsc on CentOS 7 with Quadro K4000 Linux	2	1034	August 25, 2016
410.78 driver, GPUs will lock up Linux	7	2739	March 29, 2019
Latest 470.223.02 driver has serious problems CUDA NVCC Compiler	0	455	November 2, 2023
Nvidia-drm Failed to map when waking up on Ubuntu 23.10 GPU - Hardware ubuntu	8	1224	January 10, 2024
V100 GPUs hang randomly CUDA Programming and Performance	1	966	May 29, 2019

440.33.01 driver, process random hang with uvm_va_space_destroy

Related topics