We use nvidia gpu’s (RTXA4000’s) to do video transcoding using ffmpeg under linux. We run ffmpeg instances under a C++ daemon that handles launching instances to do live transcoding. We run these on dual xenon cores (40 cores total) with 4x RTX A4000 cards in each server.
We recently added a feature to allow bringing in up to 6 live inputs and transcoding those to 1 slightly delayed output. Our streams are multicast video and we have to de-interlace every stream since our live videos sometimes have ads inserted that may be interlaced.
We noticed a very strange behavior when taking in 6 live inputs, de-interlacing them, then scaling each input before it was combined into a single mosaic output. If I ran our daemon directly under a linux console (ubuntu 22.04 server) the mosaic would run fine and continue to run. However, if I ran the daemon and it was started by systemd (through a startup script on reboot), ffmpeg would start and eventually use up 100% on single core for several seconds and it would always be stuck in this stack trace when I caught it:
#0 0x00007ff23ab6efbb in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#1 0x00007ff23ab61c2d in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#2 0x00007ff23ab64890 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#3 0x00007ff23ab64a77 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#4 0x00007ff23ab67656 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#5 0x00007ff23ab6942e in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#6 0x00007ff23ab695c3 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#7 0x00007ff23aa6454f in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#8 0x00007ff23aa646df in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#9 0x00007ff23aa39a1b in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#10 0x00007ff23aa3b841 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#11 0x00007ff23b10cdd3 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#12 0x00007ff23b10ce77 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#13 0x00007ff23a8f29d4 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#14 0x00007ff23a8fbf3e in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#15 0x00007ff23a900992 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#16 0x00007ff23a901ef0 in ?? () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#17 0x00007ff23a8f42bc in __cuda_CallJitEntryPoint () from /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
#18 0x00007ff2b4aefa7a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#19 0x00007ff2b4af042a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#20 0x00007ff2b4866732 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#21 0x00007ff2b4886475 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#22 0x00007ff2b4755b12 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#23 0x00007ff2b48c9903 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#24 0x0000555c4ea1ff76 in ff_cuda_load_module (avctx=avctx@entry=0x555c57267000, hwctx=, cu_module=cu_module@entry=0x555c572671d0, data=, length=) at libavfilter/cuda/load_helper.c:90
#25 0x0000555c4e7283d4 in cudascale_load_functions (ctx=0x555c57267000) at libavfilter/vf_scale_cuda.c:333
#26 cudascale_config_props (outlink=) at libavfilter/vf_scale_cuda.c:403
#27 0x0000555c4e7f7fa4 in avfilter_config_links (filter=0x555c57265580) at libavfilter/avfilter.c:306
#28 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c57266dc0) at libavfilter/avfilter.c:295
#29 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c57268340) at libavfilter/avfilter.c:295
#30 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c572638c0) at libavfilter/avfilter.c:295
#31 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5d72de40) at libavfilter/avfilter.c:295
#32 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5d736f80) at libavfilter/avfilter.c:295
#33 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5dd2b200) at libavfilter/avfilter.c:295
#34 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5dd2c380) at libavfilter/avfilter.c:295
#35 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5dd2cc40) at libavfilter/avfilter.c:295
#36 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5dd2d780) at libavfilter/avfilter.c:295
#37 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5dd2e3c0) at libavfilter/avfilter.c:295
#38 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c5dd35f80) at libavfilter/avfilter.c:295
#39 0x0000555c4e7f7f87 in avfilter_config_links (filter=0x555c57272e00) at libavfilter/avfilter.c:295
#40 0x0000555c4e7fc924 in graph_config_links (graph=, graph=, log_ctx=) at libavfilter/avfiltergraph.c:258
#41 avfilter_graph_config (graphctx=0x555c5d503f00, log_ctx=log_ctx@entry=0x0) at libavfilter/avfiltergraph.c:1221
#42 0x0000555c4e7bb0e9 in configure_filtergraph (fg=fg@entry=0x555c522b65c0) at fftools/ffmpeg_filter.c:1057
#43 0x0000555c4e7cf3ea in ifilter_send_frame (frame=0x555c5d4f8d00, ifilter=0x555c57a34dc0) at fftools/ffmpeg.c:2408
#44 send_frame_to_filters (ist=ist@entry=0x555c52387b00, decoded_frame=decoded_frame@entry=0x555c5d4f8d00) at fftools/ffmpeg.c:2505
#45 0x0000555c4e7cfaf6 in decode_video (ist=ist@entry=0x555c52387b00, pkt=pkt@entry=0x555c5d4ff400, got_output=got_output@entry=0x7ffc4db87798, duration_pts=duration_pts@entry=0x7ffc4db877a8, eof=eof@entry=1,
decode_failed=decode_failed@entry=0x7ffc4db8779c) at fftools/ffmpeg.c:2702
#46 0x0000555c4e7d1090 in process_input_packet (ist=0x555c52387b00, no_eof=no_eof@entry=0, pkt=0x0) at fftools/ffmpeg.c:2868
#47 0x0000555c4e7d2a1d in process_input (file_index=6) at fftools/ffmpeg.c:4607
#48 transcode_step () at fftools/ffmpeg.c:4962
#49 transcode () at fftools/ffmpeg.c:5016
#50 0x0000555c4e7a9c7e in main (argc=222, argv=0x7ffc4db87db8) at fftools/ffmpeg.c:5226
For some reason if ran from systemd some jit compilation of cuda code gets stuck for long enough to cause buffer overflows of video incoming into ffmpeg.
Here is our info from nvidia-smi for drive and cuda versions:
NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8
I have linked ffmpeg with several different versions of cuda but all seem to exhibit the same behavior. I eventually got around this issue by removing all cuda based filters, this meant I was not de-interlacing any more, and I switched to using scale_npp from scale_cuda.
Here was my original ffmpeg command:
ffmpeg -y -nostats -nostdin -loglevel info -probesize 5M -fflags +genpts -fflags discardcorrupt \
-hwaccel cuda -hwaccel_output_format cuda -c:v h264 -i "udp://@232.228.76.13:10102?fifo_size=114688&buffer_size=851968&timeout=800000&overrun_nonfatal=1" \
-hwaccel cuda -hwaccel_output_format cuda -c:v h264 -i "udp://@225.105.0.11:10102?fifo_size=114688&buffer_size=851968&timeout=800000&overrun_nonfatal=1" \
-hwaccel cuda -hwaccel_output_format cuda -c:v h264 -i "udp://@225.105.0.10:10102?fifo_size=114688&buffer_size=851968&timeout=800000&overrun_nonfatal=1" \
-hwaccel cuda -hwaccel_output_format cuda -c:v h264 -i "udp://@225.105.0.12:10102?fifo_size=114688&buffer_size=851968&timeout=800000&overrun_nonfatal=1" \
-hwaccel cuda -hwaccel_output_format cuda -c:v h264 -i "udp://@225.105.0.14:10102?fifo_size=114688&buffer_size=851968&timeout=800000&overrun_nonfatal=1" \
-hwaccel cuda -hwaccel_output_format cuda -c:v h264 -i "udp://@225.105.0.9:10102?fifo_size=114688&buffer_size=851968&timeout=800000&overrun_nonfatal=1" \
-filter_complex "\
[0:v]yadif_cuda,scale_cuda=w=768:h=432,hwdownload,format=nv12[v1]; \
[1:v]yadif_cuda,scale_cuda=w=768:h=432,hwdownload,format=nv12[v2]; \
[2:v]yadif_cuda,scale_cuda=w=768:h=432,hwdownload,format=nv12[v3]; \
[3:v]yadif_cuda,scale_cuda=w=768:h=432,hwdownload,format=nv12[v4]; \
[4:v]yadif_cuda,scale_cuda=w=768:h=432,hwdownload,format=nv12[v5]; \
[5:v]yadif_cuda,scale_cuda=w=768:h=432,hwdownload,format=nv12[v6]; \
[v1][v2][v3][v4][v5][v6] xstack=inputs=6:layout=0_0|0_h0|w0_0|w0_h0|w0+w3_0|w0+w3_h0[mosiac]; \
[mosiac]hwupload_cuda,scale_cuda=w=1920:h=1080:format=yuv420p:force_original_aspect_ratio=decrease,hwdownload,format=yuv420p,pad=1920:1080:(ow-iw)/2:(oh-ih)/2,hwupload[v]" \
-map "[v]" -map 0:a -map 1:a -map 2:a -map 3:a \
-c:v h264_nvenc -r:v 30000/1001 -b:v 4500k -minrate:v 4500k -maxrate:v 4500k -bufsize:v 9000k -cbr 1 -forced-idr 1 -strict_gop 1 -threads 1 -profile:v high -bf:v 2 -g:v 15 \
-filter:a "aresample=async=1" -c:a aac -ac:a 2 -ar:a 48000 -b:a 192k -vsync 1 \
-f mpegts -muxrate 5818900 -pes_payload_size 1528 "udp://@229.100.100.44:10102?pkt_size=1316&fifo_size=90000&bitrate=5818900&burst_bits=10528"