440.33.01 driver, process random hang with uvm_va_space_destroy

We have some GPUs which randomly lock up process when the process exit.

these GPUs are running pytorch jobs, and randomly hang when exit.

Driver Version: 440.33.01 CUDA Version: 10.2 HW : TITAN X (Pascal)
System : Ubuntu 16.04 Kernel : 4.15.0-74-generic

The process stack like this :
[<0>] _raw_q_flush+0x6f/0x90 [nvidia_uvm]
[<0>] nv_kthread_q_flush+0x19/0x70 [nvidia_uvm]
[<0>] uvm_va_space_destroy+0x3b9/0x440 [nvidia_uvm]
[<0>] uvm_release.isra.7+0x7c/0x90 [nvidia_uvm]
[<0>] uvm_release_entry+0x4d/0xa0 [nvidia_uvm]
[<0>] __fput+0xea/0x220
[<0>] ____fput+0xe/0x10
[<0>] task_work_run+0x8a/0xb0
[<0>] do_exit+0x2e9/0xbd0
[<0>] do_group_exit+0x43/0xb0
[<0>] get_signal+0x169/0x820
[<0>] do_signal+0x37/0x730
[<0>] exit_to_usermode_loop+0x80/0xd0
[<0>] do_syscall_64+0x100/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

if there is more information needed, i will be happy to provide them.nvidia-bug-report.log (3.7 MB)

What does “some GPUs” mean? Not all are doing this? Are those the same models?

We have lots of machines with 8 GPU card. And this problem occurred randomly on different machine and different GPU.

When we kill the program and we found the state of program become D and hang and keep the fd of the device.

The program usually use 4 or 2 GPU, but it’s seem like one GPU has been broken by nvidia-smi show.

Yes, Use the same program and same models usually ok but sometime broken.

Looking at the logs, you’re simply out of memory, processes crashing or being killed by the kernel’s oom killer. Also, you have no swap enabled.

Yes, but why the process can not exit successfully and hang with uvm?

Though the backtrace doesn’t explicitly tell it, I suspect the kernel also needs to allocate some memory space to release it. Please check if this applies:
https://bugs.schedmd.com/show_bug.cgi?id=5092#c3

Thanks! I will try it

I found another process that same like that.
But it’s not been killed by oom…
Can you help me to diagnose that? Thanks a lot!
nvidia-bug-report.log (3.7 MB)

To make sure, do you have the nvidia-persistenced daemon correctly running? There’s a gpu without load not throttling down.

we don’t use nvidia-persistenced, but there are nvidia-docker-plugin running.
It’s seem like nvidia-docker-plugin also keep the fd of nvidia device.

gentle ping?

I wouldn’t count on docker keeping the driver alive. Please set up nvidia-persistenced correctly.

Thanks! I will try it

@xinglong940713 hello, i met same problem with you , do you resolve it finally ?


[81976.938407] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 1): MMU NACK Errors
[81976.938411] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics SM Global Exception on (GPC 5, TPC 5, SM 1): Multiple Warp Errors
[81976.938414] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics Exception: ESR 0x52efb0=0xb090020 0x52efb4=0x4 0x52efa8=0x4c1eb72 0x52efac=0x174
[81976.966542] NVRM: Xid (PCI:0000:2a:00): 13, pid=54348, Graphics Exception: ChID 0010, Class 0000c3c0, Offset 00000510, Data 00419e84
[81976.992350] NVRM: Xid (PCI:0000:2a:00): 31, pid=54348, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
[105319.485637] mce: [Hardware Error]: Machine check events logged
[145799.382550] INFO: task hdfs_window_sou:80743 blocked for more than 120 seconds.
[145799.382553] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[145799.382554] hdfs_window_sou D 0000000000000000 0 80743 1 0x00000006
[145799.382558] Call Trace:
[145799.382569] [] schedule+0x29/0x80
[145799.382572] [] schedule_timeout+0x1f1/0x290
[145799.382578] [] ? x2apic_send_IPI_mask+0x13/0x20
[145799.382583] [] ? try_to_wake_up+0x1e9/0x300
[145799.382585] [] wait_for_completion+0x9f/0x110
[145799.382587] [] ? wake_up_state+0x20/0x20
[145799.382602] [] _raw_q_flush+0x5d/0x70 [nvidia_uvm]
[145799.382606] [] ? _raw_q_flush+0x70/0x70 [nvidia_uvm]
[145799.382611] [] nv_kthread_q_flush+0x19/0x90 [nvidia_uvm]
[145799.382619] [] uvm_va_space_destroy+0x2dc/0x420 [nvidia_uvm]
[145799.382623] [] uvm_release.isra.5+0x80/0xa0 [nvidia_uvm]
[145799.382627] [] uvm_release_entry+0x45/0xa0 [nvidia_uvm]
[145799.382631] [] __fput+0xec/0x270
[145799.382632] [] ____fput+0xe/0x10
[145799.382636] [] task_work_run+0xc4/0xe0
[145799.382641] [] do_exit+0x2c7/0xa80
[145799.382642] [] do_group_exit+0x3f/0xb0
[145799.382646] [] get_signal_to_deliver+0x1cb/0x5d0
[145799.382649] [] do_signal+0x48/0x6a0
[145799.382652] [] ? __do_page_fault+0x241/0x510
[145799.382653] [] do_notify_resume+0x5f/0xb0
[145799.382656] [] int_signal+0x12/0x17
[145919.473759] INFO: task hdfs_window_sou:80743 blocked for more than 120 seconds.