440.33.01 driver, process random hang with uvm_va_space_destroy

We have some GPUs which randomly lock up process when the process exit.

these GPUs are running pytorch jobs, and randomly hang when exit.

Driver Version: 440.33.01 CUDA Version: 10.2 HW : TITAN X (Pascal)
System : Ubuntu 16.04 Kernel : 4.15.0-74-generic

The process stack like this :
[<0>] _raw_q_flush+0x6f/0x90 [nvidia_uvm]
[<0>] nv_kthread_q_flush+0x19/0x70 [nvidia_uvm]
[<0>] uvm_va_space_destroy+0x3b9/0x440 [nvidia_uvm]
[<0>] uvm_release.isra.7+0x7c/0x90 [nvidia_uvm]
[<0>] uvm_release_entry+0x4d/0xa0 [nvidia_uvm]
[<0>] __fput+0xea/0x220
[<0>] ____fput+0xe/0x10
[<0>] task_work_run+0x8a/0xb0
[<0>] do_exit+0x2e9/0xbd0
[<0>] do_group_exit+0x43/0xb0
[<0>] get_signal+0x169/0x820
[<0>] do_signal+0x37/0x730
[<0>] exit_to_usermode_loop+0x80/0xd0
[<0>] do_syscall_64+0x100/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

if there is more information needed, i will be happy to provide them.nvidia-bug-report.log (3.7 MB)

What does “some GPUs” mean? Not all are doing this? Are those the same models?

We have lots of machines with 8 GPU card. And this problem occurred randomly on different machine and different GPU.

When we kill the program and we found the state of program become D and hang and keep the fd of the device.

The program usually use 4 or 2 GPU, but it’s seem like one GPU has been broken by nvidia-smi show.

Yes, Use the same program and same models usually ok but sometime broken.

Looking at the logs, you’re simply out of memory, processes crashing or being killed by the kernel’s oom killer. Also, you have no swap enabled.

Yes, but why the process can not exit successfully and hang with uvm?

Though the backtrace doesn’t explicitly tell it, I suspect the kernel also needs to allocate some memory space to release it. Please check if this applies:
https://bugs.schedmd.com/show_bug.cgi?id=5092#c3

Thanks! I will try it

I found another process that same like that.
But it’s not been killed by oom…
Can you help me to diagnose that? Thanks a lot!
nvidia-bug-report.log (3.7 MB)

To make sure, do you have the nvidia-persistenced daemon correctly running? There’s a gpu without load not throttling down.

we don’t use nvidia-persistenced, but there are nvidia-docker-plugin running.
It’s seem like nvidia-docker-plugin also keep the fd of nvidia device.

gentle ping?

I wouldn’t count on docker keeping the driver alive. Please set up nvidia-persistenced correctly.

Thanks! I will try it