Multiple FFMpeg-Cuda-HLS-Transcoding Instances -> Deadlock Behavior


i’m using a NVIDIA Quadro P2200 and the latest ubuntu linux to transcode multiple Multicast Streams into HLS. I tried different Versions of the NVidia Driver. The latest one and right now i’m on 440.33.01. Headless. The Transcoding works flawless with CPU Encoding and Libx264. But when i switch to cuda 2 of 3 processes will deadlock with a time. I opend three bash shells and made a screencast to show you the problem. At minute six you will notice that the first stream will stop doing anything.

Because i’m using scale_npp, i build ffmpeg by my own:

./configure --enable-libx264 --enable-cuvid --enable-gpl --enable-libnpp --enable-cuda --disable-cuda-sdk --enable-nonfree --extra-cflags=-I/usr/local/cuda-10.2/include --extra-ldflags=-L/usr/local/cuda-10.2/lib64 && make -j 8

I tried different combinations of cuda and driver versions and the behavior was everywhere the same. I also tried different ffmpeg commands with the same result. How can i get HLS encoding working? One single transcoding process working. If i have more than one ffmpeg process all will fail until one single process who will still be working. With another words all the other transcoding processes seems to deadlocking. I looked into the thread stack with gdb -p pid, but it did not help. How to fix that issue?

ffmpeg version N-97495-g2594f6a362 Copyright © 2000-2020 the FFmpeg developers
built with gcc 9 (Ubuntu 9.2.1-9ubuntu2)
configuration: --enable-libx264 --enable-cuvid --enable-gpl --enable-libnpp --enable-cuda --disable-cuda-sdk --enable-nonfree --extra-cflags=-I/usr/local/cuda-10.2/include --extra-ldflags=-L/usr/local/cuda-10.2/lib64
libavutil 56. 43.100 / 56. 43.100
libavcodec 58. 82.100 / 58. 82.100
libavformat 58. 42.101 / 58. 42.101
libavdevice 58. 9.103 / 58. 9.103
libavfilter 7. 79.100 / 7. 79.100
libswscale 5. 6.101 / 5. 6.101
libswresample 3. 6.100 / 3. 6.100
libpostproc 55. 6.100 / 55. 6.100
Hyper fast Audio and Video encoder

-vsync 0
-loglevel debug
-threads:v 1
-threads:a 1
-filter_threads 1
-thread_queue_size 1024
-hwaccel cuda
-hwaccel_device 0
-hwaccel_output_format cuda
-deint adaptive
-filter_complex “[v:0]split=4[temp1][temp2][source][temp3];[temp1]scale_npp=858:480[480p];[temp2]scale_npp=640:360[wide360p];[temp3]scale_npp=426:240[240p]”
-g 50 -sc_threshold 0
-map [wide360p]
-preset medium
-c:v:0 h264_nvenc
-preset fast
-profile:v baseline
-b:v:0 600k
-bufsize 24k
-minrate 400k -maxrate 600k
-map [480p]
-c:v:1 h264_nvenc
-preset medium
-profile:v baseline
-b:v:1 1000k
-bufsize 56k
-minrate 800k -maxrate 1600k
-preset fast
-map [source]
-c:v:2 h264_nvenc
-preset medium
-profile:v baseline
-preset fast
-b:v:2 3600k
-minrate 2000k -maxrate 4000k
-bufsize 144k
-map [240p]
-c:v:3 h264_nvenc
-preset medium
-profile:v baseline
-zerolatency 1
-preset fast
-b:v:3 400k
-bufsize 16k
-map a:0
-c:a aac
-b:a 128k
-ac 2
-map a:1
-c:a aac
-b:a 96k
-ac 2
-f hls
-hls_time 4
-hls_list_size 0
-hls_flags append_list
-hls_allow_cache 0
-hls_playlist_type event
-master_pl_name $MASTER_PLAYLIST_NAME
-var_stream_map “a:0,agroup:audio,default:yes,language:DEU a:1,agroup:audio,language:FR v:0,agroup:audio v:1,agroup:audio, v:2,agroup:audio, v:3,agroup:audio”

I rebuild ffmpeg and took the build information out of the official document Using_FFmpeg_with_NVIDIA_GPU_Hardware_Acceleration_v01.4.pdf.

ffmpeg version N-97515-gd813e43b3d Copyright © 2000-2020 the FFmpeg developers
built with gcc 8 (Ubuntu 8.4.0-1ubuntu1~19.10)
configuration: --enable-nonfree --enable-cuda-nvcc --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64

But it does not make any difference. The parallel transcoding fails. Only one processs stays working. The other processes stop. I actually don’t know how to get this to work properly. It’s very disappointing right now. Still hope to find some solution to fix this behavior.

Did you check if vmem fills up while transcoding?

I checked it right now. The systems owns 64 GB Ram and there is a little Swap Disk of 1 GB. But it’s not beeing used.

I meant video memory, not system memory. Use nvidia-smi to check usage.

ah i thought virtual ram, because you can see in the youtube video link, that i opened nvidia-smi at the beginning. And yes the video ram will be allocated and if the transcoding processes freezes - it’s still used.

| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Quadro P2200 Off | 00000000:08:00.0 Off | N/A |
| 50% 42C P0 22W / 75W | 598MiB / 5059MiB | 2% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| 0 13736 C /home/hls/ffmpeg/ffmpeg 196MiB |
| 0 14476 C /home/hls/ffmpeg/ffmpeg 196MiB |
| 0 14523 C /home/hls/ffmpeg/ffmpeg 196MiB |

Process 13736 is not transcoding anymore, but the ram is still in use. I don’t know what’s internally happening, because i can’t look with gdb into lib cuda.

I rather suspected an out-of-vmem condition, people had problems with that before. Isn’t the case, though. Don’t know if the gcc version >8 has any influence on it, did you ask at ffmpeg’s?

I rebuild ffmpeg with gcc 8. I will try another gcc version later that day. I posted my issue to the ffmpeg user list right now. How can i verify or exclude that it is an out of vmem problem?

(I will search the forum for out of vmem, maybe i find something interessting)

It is not a problem with video memory, nvidia-smi tells 5059MiB free.

Generix, sorry was not concentrated when i was reading your post so i did not get the information, that you excluded the vmem. I need to improve my English reading skills in all technical matters. But i rebuild ffmpeg with gcc 7.5.0 with the same results.

Configured with: …/src/configure -v --with-pkgversion=‘Ubuntu 7.5.0-3ubuntu1~19.10’ --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~19.10)

Maybe some general advice: since you’re running headless, please make sure nvidia-persistenced is started on boot and is continuously running.
Since you’re using multiple ffmped processes, try using MPS.

Thanks, i’m reading right now.

It seams that the pascal architecture of the p2200 is not supported by MPS in the video rendering context.

The NVIDIA Codec SDK: is
not supported under MPS on pre-Volta MPS clients.

If i try to run an ffmpeg instance with mps server running on the p2200 in exclusive mode i get:

[Parsed_scale_cuda_1 @ 0x55ff024f2cc0] auto-inserting filter ‘auto_scaler_0’ between the filter ‘Parsed_split_0’ and the filter ‘Parsed_scale_cuda_1’
Impossible to convert between the formats supported by the filter ‘Parsed_split_0’ and the filter ‘auto_scaler_0’
Error reinitializing filters!
Failed to inject frame into filter network: Function not implemented
Error while processing the decoded data for stream #0:3
[aac @ 0x55ff00c6ed40] Qavg: 205.839
[aac @ 0x55ff00c6ed40] 2 frames left in the queue on closing
[aac @ 0x55ff00c44680] Qavg: 207.565
[aac @ 0x55ff00c44680] 2 frames left in the queue on closing
[AVIOContext @ 0x55ff00be41c0] Statistics: 5846988 bytes read, 0 seeks
Conversion failed!

So mps is no solution at all but it was a very interessting read and at least a try. So what’s the reason for my issue? Is it probably a failure in the vmem allocation implementation of ffmpeg ?

Hello Nvidia, any hints?

Don’t know nsight compute can help with nvenc debugging, did you look into it?