Rare futex lock inside libnvcuvid

I am using ffmpeg to do live transcodes with Nvidia hardware and generally it works great. However sometimes the process seems to hang and doing nothing. I built a debug version of ffmpeg 4.3.1 to check what is going on and it seems the process is blocked with a futex defined in either libnvcuvid or libcuda. Here is the call stack.

Attaching to process 6581
[New LWP 6582]
[New LWP 6732]
[New LWP 6733]
[New LWP 6736]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ffc551ffc60)
at …/sysdeps/unix/sysv/linux/futex-internal.h:205
205 …/sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) up
#1 do_futex_wait (sem=sem@entry=0x7ffc551ffc60, abstime=0x0) at sem_waitcommon.c:111
111 sem_waitcommon.c: No such file or directory.
(gdb)
#2 0x00007f43238ac988 in __new_sem_wait_slow (sem=0x7ffc551ffc60, abstime=0x0) at sem_waitcommon.c:181
181 in sem_waitcommon.c
(gdb)
#3 0x00007f43179bf862 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb)
#4 0x00007f4317aea32d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb)
#5 0x00007f4317979113 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb)
#6 0x00007f4317a3f9b7 in cuEventSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb)
#7 0x00007f42f39c95df in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
(gdb)
#8 0x000055c554ea3edc in cuvid_handle_picture_decode (opaque=0x55c557873a00, picparams=0x55c558510700)
at libavcodec/cuviddec.c:337
337 ctx->internal_error = CHECK_CU(ctx->cvdl->cuvidDecodePicture(ctx->cudecoder, picparams));
(gdb)
#9 0x00007f42f39bd7a8 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
(gdb)
#10 0x00007f42f3a2649f in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
(gdb)
#11 0x00007f42f3a26b55 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
(gdb)
#12 0x00007f42f3a26e84 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
(gdb)
#13 0x00007f42f39bd2a8 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
(gdb)
#14 0x000055c554ea412b in cuvid_decode_packet (avctx=avctx@entry=0x55c557873a00,
avpkt=avpkt@entry=0x7ffc55200340) at libavcodec/cuviddec.c:421
421 ret = CHECK_CU(ctx->cvdl->cuvidParseVideoData(ctx->cuparser, &cupkt));
(gdb)
#15 0x000055c554ea4c2a in cuvid_output_frame (avctx=0x55c557873a00, frame=0x55c557f08d40) at libavcodec/cuviddec.c:468
468 ret = cuvid_decode_packet(avctx, &pkt);
(gdb)
#16 0x000055c554eb2361 in decode_receive_frame_internal (avctx=avctx@entry=0x55c557873a00, frame=0x55c557f08d40) at libavcodec/decode.c:554
554 ret = avctx->codec->receive_frame(avctx, frame);
(gdb)
#17 0x000055c554eb3138 in avcodec_send_packet (avctx=0x55c557873a00, avpkt=0x7ffc552006f0) at libavcodec/decode.c:614
614 ret = decode_receive_frame_internal(avctx, avci->buffer_frame);
(gdb)
#18 0x000055c554a79e67 in decode (pkt=0x7ffc552006f0, got_frame=0x7ffc5520066c, frame=, avctx=0x55c557873a00) at fftools/ffmpeg.c:2217
2217 ret = avcodec_send_packet(avctx, pkt);
(gdb)
#19 decode_video (decode_failed=, eof=, duration_pts=, got_output=, pkt=,
ist=) at fftools/ffmpeg.c:2359
2359 ret = decode(ist->dec_ctx, decoded_frame, got_output, pkt ? &avpkt : NULL);
(gdb)
#20 process_input_packet (ist=, pkt=0x7ffc552008d0, no_eof=0) at fftools/ffmpeg.c:2600
2600 ret = decode_video (ist, repeating ? NULL : &avpkt, &got_output, &duration_pts, !pkt,
(gdb)
#21 0x000055c554a7d91b in process_input (file_index=) at fftools/ffmpeg.c:4491
4491 process_input_packet(ist, &pkt, 0);
(gdb)
#22 transcode_step () at fftools/ffmpeg.c:4611
4611 ret = process_input(ist->file_index);
(gdb)
#23 transcode () at fftools/ffmpeg.c:4665
4665 ret = transcode_step();
(gdb)
#24 0x000055c554a5785e in main (argc=96, argv=0x7ffc552011b8) at fftools/ffmpeg.c:4870
4870 if (transcode() < 0)
(gdb)
Initial frame selected; you cannot go up.

The issue happens completely random, a transcode might run for days without issue and suddently it locks in this state.

Currently testing with drivers 455.32.0 and 450.80.02 but was also happening with 440 and 418 (can’t remember specific driver version)

Can any developer from Nvidia check this issue? Unfortunately there are no debug symbols for nvidia libraries so I cannot understand what triggers the issue.

I met the same issue using ffmpeg.
This stuck case is not triggered by posix sem, but the cuda-evthandlr thread which fail to recv cuda event, instead of release sem after recv cuda event.