deadlock in nvidia driver

Hi, we are using the Diamond Visionics flight simulator with
the NVidia driver, rev 387.12, on an Intel x86_64 system running
Linux 4.4.86.

We are getting a lockup of X11 when the simulator is terminated
by typing the ESC key. A ‘ps’ command shows a simulator thread
stuck in uninterrupible sleep. A ‘bt’ command executed from
the ‘crash’ command shows that the thread is asleep on a kernel
‘down()’ service and that the ‘down’ was invoked from the
nvidia driver/kernel interface library.

I wrote a debug kernel patch that lets me do the missing ‘up()’
by writing to a /proc/$pid/task/$tid/sema file, created by the
patch. When I do this ‘up()’ the hung thread exits successfully
and X11 recovers and all works (apparently) fine.

Not only that, but I can re-run the Diamond Visionics simulator
over and over, terminating it with the ESC key each time, and
all threads will close down successfully after that point,
until the next reboot, at which point the first time the
simulator is terminated the hang occurs and the ‘up()’ needs
to be redone to get X11/NVidia going again.

It is clear there is a path through the driver where a
needed ‘up()’ is not being executed. This is probably along
some error or early-termination path associated with
unexpected (to the NVidia driver) task exit.

Please look into this and fix.

Thanks,
Joe

PS: I have attached a ‘crash’ stack traceback of the
kernel stack of the hung thread. Hopefully that will
have some clues as to where and why the NVidia driver
is getting hung up.

crash> bt 11003
PID: 11003 TASK: ffff88101a078000 CPU: 6 COMMAND: “IG64”
#0 [ffff8800357e32e0] __schedule at ffffffff81f5a846
#1 [ffff8800357e3350] schedule at ffffffff81f5b347
#2 [ffff8800357e3370] schedule_timeout at ffffffff81f5f8f5
#3 [ffff8800357e3458] __down at ffffffff81f5c2d9
#4 [ffff8800357e34a0] down at ffffffff810c2163
#5 [ffff8800357e34c8] os_acquire_semaphore at ffffffffa0140017 [nvidia]
#6 [ffff8800357e34e8] _nv007928rm at ffffffffa06ad14d [nvidia]
#7 [ffff8800357e3518] _nv032949rm at ffffffffa06ac18f [nvidia]
#8 [ffff8800357e3538] _nv007808rm at ffffffffa06f4be4 [nvidia]
#9 [ffff8800357e3548] _nv007807rm at ffffffffa06f4a78 [nvidia]
#10 [ffff8800357e3578] _nv001117rm at ffffffffa0746d90 [nvidia]
#11 [ffff8800357e3598] _nv001118rm at ffffffffa0746f57 [nvidia]
#12 [ffff8800357e35c8] _nv001265rm at ffffffffa0744cb2 [nvidia]
#13 [ffff8800357e35e8] _nv001286rm at ffffffffa0744e1b [nvidia]
#14 [ffff8800357e3618] _nv001285rm at ffffffffa074503e [nvidia]
#15 [ffff8800357e3658] _nv033138rm at ffffffffa0541b67 [nvidia]
#16 [ffff8800357e3688] _nv003508rm at ffffffffa06ba97d [nvidia]
#17 [ffff8800357e3698] _nv003816rm at ffffffffa06afb95 [nvidia]
#18 [ffff8800357e36b8] _nv010693rm at ffffffffa06addc0 [nvidia]
#19 [ffff8800357e36e8] _nv010692rm at ffffffffa06adf38 [nvidia]
#20 [ffff8800357e3718] _nv032952rm at ffffffffa06ac359 [nvidia]
#21 [ffff8800357e3738] _nv007810rm at ffffffffa06f4d6f [nvidia]
#22 [ffff8800357e3768] _nv007807rm at ffffffffa06f4a78 [nvidia]
#23 [ffff8800357e3798] _nv010692rm at ffffffffa06adf04 [nvidia]
#24 [ffff8800357e37c8] _nv032952rm at ffffffffa06ac359 [nvidia]
#25 [ffff8800357e37e8] _nv007810rm at ffffffffa06f4d6f [nvidia]
#26 [ffff8800357e3818] _nv007807rm at ffffffffa06f4a78 [nvidia]
#27 [ffff8800357e3848] _nv010693rm at ffffffffa06add05 [nvidia]
#28 [ffff8800357e3878] _nv010692rm at ffffffffa06adf38 [nvidia]
#29 [ffff8800357e38a8] _nv032952rm at ffffffffa06ac359 [nvidia]
#30 [ffff8800357e38c8] _nv007810rm at ffffffffa06f4d6f [nvidia]
#31 [ffff8800357e38f8] _nv007807rm at ffffffffa06f4a78 [nvidia]
#32 [ffff8800357e3928] _nv010692rm at ffffffffa06adf04 [nvidia]
#33 [ffff8800357e3958] _nv032952rm at ffffffffa06ac359 [nvidia]
#34 [ffff8800357e3978] _nv007810rm at ffffffffa06f4d6f [nvidia]
#35 [ffff8800357e39a8] _nv007807rm at ffffffffa06f4a78 [nvidia]
#36 [ffff8800357e39d8] _nv010671rm at ffffffffa06d4881 [nvidia]
#37 [ffff8800357e3a08] _nv007923rm at ffffffffa06abb21 [nvidia]
#38 [ffff8800357e3a28] _nv032949rm at ffffffffa06ac1d9 [nvidia]
#39 [ffff8800357e3a48] _nv007808rm at ffffffffa06f4be4 [nvidia]
#40 [ffff8800357e3a58] _nv007807rm at ffffffffa06f4a78 [nvidia]
#41 [ffff8800357e3a88] _nv001153rm at ffffffffa0781e92 [nvidia]
#42 [ffff8800357e3ab8] rm_free_unused_clients at ffffffffa07852b1 [nvidia]
#43 [ffff8800357e3bc8] nvidia_frontend_close at ffffffffa01343fc [nvidia]
#44 [ffff8800357e3be8] __fput at ffffffff812160a2
#45 [ffff8800357e3c30] ____fput at ffffffff8121622e
#46 [ffff8800357e3c40] task_work_run at ffffffff81090346
#47 [ffff8800357e3c80] do_exit at ffffffff81073767
#48 [ffff8800357e3cf8] do_group_exit at ffffffff8107418c
#49 [ffff8800357e3d28] get_signal at ffffffff81080cd1
#50 [ffff8800357e3dd8] do_signal at ffffffff81005358
#51 [ffff8800357e3ee8] exit_to_usermode_loop at ffffffff8106ae18
#52 [ffff8800357e3f28] syscall_return_slowpath at ffffffff81002ce5
#53 [ffff8800357e3f50] int_ret_from_sys_call at ffffffff81f61e5a
RIP: 00007fffdd066d13 RSP: 00007fffbdf843b0 RFLAGS: 00000293
RAX: fffffffffffffffc RBX: 00007fffbdf84ad8 RCX: 00007fffdd066d13
RDX: 0000000000000080 RSI: 00007fffbdf84400 RDI: 000000000000000f
RBP: 00007fffbf4b74c0 R8: 0000000000000000 R9: 0000000000002afb
R10: 00000000ffffffff R11: 0000000000000293 R12: 00000000ffffffff
R13: 00007fffbdf84ad8 R14: 00007fffcc65d3f8 R15: 00007fffcc65d3b8
ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b
crash>