A16 Capacity decrease when using multiple GPUs

When I use a single NVDEC GPU on the A16 I can de-interlace about 25 streams. Which in theory I should be able to de-interlace about 100 streams, however ffmpeg reports the FPS drop from about 30-32 to 10-12. If I want to use all 4 GPUS, then I can only de-interlace about 15 streams total.

I have noticed that the PCIe bandwidth for the Rx/Tx both will drop. The RX drops from 95Mbps to 13-17Mpbs and the TX drops from about 3G to 700Mpbs

The command below will maintain about 30fps for 105x 1080i streams:
ffmpeg -y -hwaccel cuvid -hwaccel_device 0 -c:v h264_cuvid -i /tmp/ramdisk/1080i-H.264.mp4 -filter_complex ‘yadif_cuda=0:-1:0’ -f null –

Once I start to download the YUV frames then fps drops significantly (10-12 range)
ffmpeg -y -hwaccel cuvid -hwaccel_device 0 -c:v h264_cuvid -i /tmp/ramdisk/1080i-H.264.mp4 -filter_complex ‘yadif_cuda=0:-1:0,hwdownload,format=nv12’ -f null -

System configuration:
FFMpeg version 6.0, 5.0, 4.3.1
Gstreamer: 1.18.0
Ubuntu: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-146-generic x86_64)
Cuda: 12.1
Driver: 525.85.12

Hello there,

To understand your question better, when you run:

ffmpeg -y -hwaccel cuvid -hwaccel_device 0 -c:v h264_cuvid -i /tmp/ramdisk/1080i-H.264.mp4 -filter_complex ‘yadif_cuda=0:-1:0’ -f null –

Can it sustain 105 simultaneous sessions on a single GPU, as implied by -hwaccel_device 0 selection above?

If so, then:

For the second command, when you download the YUV frames, can you try these alternatives and report back on throughput performance at the initial load (105 per GPU)? I want to confirm something.

Try either:

1. Deinterlacing with the CUVID wrapper:

ffmpeg -y -hwaccel cuvid -hwaccel_device 0 -c:v h264_cuvid -deint 2 -drop_second_field 0 -i /tmp/ramdisk/1080i-H.264.mp4 -vf ‘hwdownload,format=nv12’ -f null -

2. Deinterlacing with yadif_cuda directly after using nvdec:

ffmpeg -y -threads:v 1 -extra_hw_frames 3 -hwaccel nvdec -hwaccel_output_format cuda -hwaccel_device 0 -c:v h264_cuvid -i /tmp/ramdisk/1080i-H.264.mp4 -filter_complex ‘yadif_cuda=0:-1:0,hwdownload,format=nv12’ -f null -

3. Disabling hwaccels all-together and deinterlacing in H/W only:

ffmpeg -y -i /tmp/ramdisk/1080i-H.264.mp4 -filter_complex ‘hwupload_cuda=0,yadif_cuda=0:-1:0,hwdownload,format=nv12’ -f null -

Secondly, from your nvidia-smi output, the Rx and Tx values look normal, as the GPU load(s) target specific GPUs on the card. Note that the A16 presents as 4x GPUs per board.

Thanks, I will give that a try.

To be clear it is 104x (not 105) across the 4 GPUs in the A16, so I have 26x streams with -hwaccel_device 0, 26x with -hwaccel_device 1, 26x with -hwaccel_device 2 and 26x with -hwaccel_device 3

Results:
Deinterlacing with the CUVID wrapper:

  • Runs at 0.06x fps ~ 4

Deinterlacing with yadif_cuda directly after using nvdec:

  • Runs at 0.16x fps ~ 4

Disabling hwaccels all-together and deinterlacing in H/W only:

  • Does not run, I get the following error: Invalid output format nv12 for hwframe download. I am not sure why it fails with this command.