FFmpeg mosaic with hardware acceleration on Quadro P4000

Hi everyone, I’m currently trying to achieve a mosaic including 4 videos using FFmpeg and hardware acceleration.

I’m working on Ubuntu 16.04, I’ve installed NVIDIA drivers using apt repo aswell as CUDA. I’m using FFmpeg in version 3.3.

I’ve succeeded to create it without hardware acceleration but I’m stuck. Here is the command line I use :

ffmpeg -hwaccel cuvid -c:v h264_cuvid \
-i bbb_sunflower_2160p_30fps_normal.mp4 \
-i bbb_sunflower_2160p_30fps_normal.mp4 \
-i bbb_sunflower_2160p_30fps_normal.mp4 \
-i bbb_sunflower_2160p_30fps_normal.mp4 \
-filter_complex "nullsrc=size=1440x960 [base]; \
[0:v] hwdownload, format=nv12, scale_npp=720x480 [upperleft]; \
[1:v] hwdownload, format=nv12, scale_npp=720x480 [upperright]; \
[2:v] hwdownload, format=nv12, scale_npp=720x480 [lowerleft]; \
[3:v] hwdownload, format=nv12, scale_npp=720x480 [lowerright]; \
[base][upperleft] overlay=shortest=1 [tmp1]; \
[tmp1][upperright] overlay=shortest=1:x=720 [tmp2]; \
[tmp2][lowerleft] overlay=shortest=1:y=480 [tmp3]; \
[tmp3][lowerright] overlay=shortest=1:x=720:y=480; \
hwupload_cuda" \
-c:v h264_nvenc -f matroska pipe:1 | ffplay -i -

It doesn’t work and I get :

Cannot find a matching stream for unlabeled input pad 0 on filter Parsed_hwupload_cuda_17

I think the issue is about the upload of modified frames by filters into the card but I can’t figure how to achieve it. I’ve tried to add [base] before hwupload_cuda but it doesn’t help, and honestly, I’m a bit lost.

Moreover, does anyone know if the gpu can directly create a mosaic without using a tierce like FFmpeg ?

It looks like you’re using 4 inputs for overlays but you don’t have an input that you’re applying the overlays to. Try adding
-i color=black:1440x960
[url]FFmpeg Filters Documentation

Oh, missed the nullsrc, forget what I said.

Yes, I create a nullsrc but I don’t know how to inject it into the gpu.

Maybe you have to name the overlay output stream:

...[lowerright] overlay=shortest=1:x=720:y=480[combo]; \
[combo]hwupload_cuda[combo]" \...
-map "[combo]"

Thank you for your help, it may be one step ahead to the solution, now I got an error :

Impossible to convert between the formats supported by the filter 'Parsed_format_2' and the filter 'auto_scaler_0'

I suppose I have to add format=XXX compliant with the GPU, before or after the hwupload_cuda.

I think this rather points to

hwdownload, format=nv12, scale_npp=720x480

scale_npp probably doesn’t like nv12.
so either use

hwdownload, format=nv12,format=XXX, scale_npp=720x480

or somehow use resize instead of scale_npp to do the scaling in cuvid which would be smarter.

On second look, scale_npp already uses cuda, so the hwdownload should be counterproductive?

Indeed, scale_npp uses cuda but I think it’s in user-space as a library and not directly over PCIe, so I believe it has to be converted in a virtual format.
But you’re right, the problem may be about the nv12 format which could be not hardware compliant.

Did you test

scale_npp=720x480, hwdownload, format=nv12

I’ve just tried this but I get an error about unsupported format by the filter.

[tmp3][lowerright] overlay=shortest=1:x=720:y=480, hwupload_cuda"

Someone advise me to do this, but I still have that error :

Impossible to convert between the formats supported by the filter 'Parsed_format_2' and the filter 'auto_scaler_0'

Ok so the final answer is :

ffmpeg \
-hwaccel cuvid -c:v h264_cuvid -i bbb_sunflower_2160p_30fps_normal.mp4 \
-hwaccel cuvid -c:v h264_cuvid -i bbb_sunflower_2160p_30fps_normal.mp4 \
-hwaccel cuvid -c:v h264_cuvid -i bbb_sunflower_2160p_30fps_normal.mp4 \
-hwaccel cuvid -c:v h264_cuvid -i bbb_sunflower_2160p_30fps_normal.mp4 \
-filter_complex "nullsrc=size=1440x960 [base]; \
[0:v] scale_npp=720:480, hwdownload, format=nv12 [upperleft]; \
[1:v] scale_npp=720:480, hwdownload, format=nv12 [upperright]; \
[2:v] scale_npp=720:480, hwdownload, format=nv12 [lowerleft]; \
[3:v] scale_npp=720:480, hwdownload, format=nv12 [lowerright]; \
[base][upperleft] overlay=shortest=1 [tmp1]; \
[tmp1][upperright] overlay=shortest=1:x=720 [tmp2]; \
[tmp2][lowerleft] overlay=shortest=1:y=480 [tmp3]; \
[tmp3][lowerright] overlay=shortest=1:x=720:y=480, hwupload_cuda" \
-c:v h264_nvenc -f matroska pipe:1 | ffplay -i -

Thank you generix, you were right about scale_npp position !