Performance limit at around 2500 fps?

I have reached a performance limit on single computer utilising multiple GPUs. It seems there is a limit at around 2500 fps.
My test server is a Threadripper 2950X with three (3) Quadro P5000.

I run 20 concurrent encodes on a single Quadro P5000 and get 65 fps per session => 1300 fps total.
Then I run another 20 concurrent encodes on the second Quadro P5000 on the same PC and get 62,5 fps per session => 2500 fps total (and not 2600)
Finally I run another 20 concurrent encodes on the third Quadro P5000 on the same PC and get 41 fps per session => 2460 fps (and not 3900)
So adding a third Quadro P5000 on the same PC offers no performance improvement at all!!!

Setting encoding quality to fast preset doesn’t change the result (tests were done with medium preset), still capped at around 2500 fps. CPU usage is very low, so this is not a bottleneck.

Any thoughts from Nvidia?

PS: Driver version is 418.56, Linux Ubuntu 18.04 LTS, each session is using ffmpeg latest git.

Show us the output of:

nvidia-smi topo --matrix

Here it is:

GPU0	GPU1	GPU2	CPU Affinity
GPU0	 X 	PHB	PHB	0-31
GPU1	PHB	 X 	PHB	0-31
GPU2	PHB	PHB	 X 	0-31

There are many information missing - resolution of frame (1080P?, RGBA?, YUV?), source of video (encoding or transcoding) ?
I suppose you hit MEMORY bandwidth of CPU memory channels (see https://en.wikichip.org/wiki/amd/ryzen_threadripper/2950x) with using PCIe DMA access.
Theoretically 87.42 GiB/s (if correctly placed memory moudules) and when using 2500192010804/1024/1024/1024 = 19 GiB/s * 2 (datain/dataout) + interrupt latency + software latency … (compare with “Sandra 2018 Titanium’s memory bandwidth test” max 40 GiB/s (58core https://www.pcworld.com/article/3298859/how-memory-bandwidth-is-killing-amds-32-core-threadripper-performance.html)).

This is the script I am testing with:

#!/bin/bash
DEVICEID=$1
for i in `seq 1 20` ;
do
ffmpeg -nostdin -loglevel error -stats \
-hwaccel_device $DEVICEID -hwaccel cuvid -c:v h264_cuvid -surfaces 12 \
-f mpegts -i input_1080i.ts \
-vf yadif_cuda=1:-1:1,scale_npp=w=1280:h=720 \
-c:v h264_nvenc -preset medium \
-refs 3 -bf 3 -rc-lookahead 30 \
-b_ref_mode middle -temporal-aq 1 \
-acodec copy -f mpegts -y /dev/null &
done
wait
echo done

Input is a 1080i50 h264 mpegts file.
You can get it from http://207.154.237.57/files/input_1080i.ts

Decoding -> deinterlacing -> scaling -> encoding lives only in GPU, so I don’t see any memory bandwidth limits anywhere.

Removing “-refs 3 -bf 3 -rc-lookahead 30 -b_ref_mode middle -temporal-aq 1” changes the numbers, but still not all GPU power is used since running on only one gpu gives 86 * 20 = 1720 fps and running on two GPUs gives 8040 = 3200 fps and finally running on three gpus gives 5260 = 3120 fps. So we observe a performance degredation adding 20 more instances in a separate 3rd GPU.

So, to correct my original post, it is not a hard limit of 2500 fps, but there is a limit somewhere running multiple instances.

Changing scaling to scale_npp=w=704:h=576 does not change the results at all, so the bottleneck is not in the nvenc part. When running all 60 instances, checking with nvidia-smi dmon shows underutilization of nvdec (reaches only 60% an all three GPUs).

Did a simpler test that only tests decoder+filters, here is the sample script:

#!/bin/bash
DEVICEID=$1
for i in `seq 1 20` ;
do 
ffmpeg -loglevel fatal -stats \
-hwaccel cuvid -hwaccel_device $DEVICEID \
-c:v h264_cuvid -surfaces 12  \
-f mpegts -i input_1080i.ts \
-vf yadif_cuda=1:-1:1,scale_npp=w=1280:h=720 \
-f null -y /dev/null &
done
wait
echo done

I even increased sessions to 30. There is no limit in this test, nvdec is utilized 100% on all three gpus, there is no dramatic reduction in running 30, 60 or 90 sessions in 1,2 or three gpus concurrently.

results of the above test
20 sessions in one gpu, 2090 => 1800 fps
40 sessions in two gpus, 40
88 => 3520 fps
60 sessions in three gpus, 60*86 => 5160 fps

So the issue must be on multiple nvenc sessions.

The issue seems to be something in the ffmpeg code.

Before the following three commits, there is no problem.



@langdalepl: If I remember well there was a ticket in trac from a user mentioning the same issue. I can’t find it right now.

Applying following patch on current git fixes the issue but I don’t know how to explain it.

--- ffmpeg/libavcodec/nvenc.c	2019-04-08 20:53:19.745925070 +0300
+++ ffmpeg/libavcodec/nvenc.c	2019-04-08 20:55:51.619074973 +0300
@@ -1846,13 +1846,6 @@
                 res = nvenc_print_error(avctx, nv_status, "Failed unmapping input resource");
                 goto error;
             }
-            nv_status = p_nvenc->nvEncUnregisterResource(ctx->nvencoder, ctx->registered_frames[tmpoutsurf->reg_idx].regptr);
-            if (nv_status != NV_ENC_SUCCESS) {
-                res = nvenc_print_error(avctx, nv_status, "Failed unregistering input resource");
-                goto error;
-            }
-            ctx->registered_frames[tmpoutsurf->reg_idx].ptr = NULL;
-            ctx->registered_frames[tmpoutsurf->reg_idx].regptr = NULL;
         } else if (ctx->registered_frames[tmpoutsurf->reg_idx].mapped < 0) {
             res = AVERROR_BUG;
             goto error;

Hmm, could you open a ticket on trac.ffmpeg.org with this?
This is a serious regression.

@brainiarc7: There was a ticket on trac for the same issue, but original author didn’t explain the solution well for current git and got unnoticed. I can’t find it right now but I will and reopen it.

Cool, thanks.
Remember to propose the patch above.

Found it, this was the ticket:
http://trac.ffmpeg.org/ticket/7674

Will update its status there.

@malakudi any pointers as to how to capture multiple rtsp streams simultaneously and decode the frames,should i spawn new cpu process for each?