Performance limit at around 2500 fps?

malakudi · April 8, 2019, 6:41am

I have reached a performance limit on single computer utilising multiple GPUs. It seems there is a limit at around 2500 fps.
My test server is a Threadripper 2950X with three (3) Quadro P5000.

I run 20 concurrent encodes on a single Quadro P5000 and get 65 fps per session => 1300 fps total.
Then I run another 20 concurrent encodes on the second Quadro P5000 on the same PC and get 62,5 fps per session => 2500 fps total (and not 2600)
Finally I run another 20 concurrent encodes on the third Quadro P5000 on the same PC and get 41 fps per session => 2460 fps (and not 3900)
So adding a third Quadro P5000 on the same PC offers no performance improvement at all!!!

Setting encoding quality to fast preset doesn’t change the result (tests were done with medium preset), still capped at around 2500 fps. CPU usage is very low, so this is not a bottleneck.

Any thoughts from Nvidia?

PS: Driver version is 418.56, Linux Ubuntu 18.04 LTS, each session is using ffmpeg latest git.

brainiarc7 · April 8, 2019, 7:28am

Show us the output of:

nvidia-smi topo --matrix

malakudi · April 8, 2019, 8:13am

Here it is:

GPU0	GPU1	GPU2	CPU Affinity
GPU0	 X 	PHB	PHB	0-31
GPU1	PHB	 X 	PHB	0-31
GPU2	PHB	PHB	 X 	0-31

anon56509511 · April 8, 2019, 8:20am

There are many information missing - resolution of frame (1080P?, RGBA?, YUV?), source of video (encoding or transcoding) ?
I suppose you hit MEMORY bandwidth of CPU memory channels (see [url]https://en.wikichip.org/wiki/amd/ryzen_threadripper/2950x[/url]) with using PCIe DMA access.
Theoretically 87.42 GiB/s (if correctly placed memory moudules) and when using 2500192010804/1024/1024/1024 = 19 GiB/s * 2 (datain/dataout) + interrupt latency + software latency … (compare with “Sandra 2018 Titanium’s memory bandwidth test” max 40 GiB/s (58core [url]https://www.pcworld.com/article/3298859/how-memory-bandwidth-is-killing-amds-32-core-threadripper-performance.html[/url])).

malakudi · April 8, 2019, 8:52am

This is the script I am testing with:

#!/bin/bash
DEVICEID=$1
for i in `seq 1 20` ;
do
ffmpeg -nostdin -loglevel error -stats \
-hwaccel_device $DEVICEID -hwaccel cuvid -c:v h264_cuvid -surfaces 12 \
-f mpegts -i input_1080i.ts \
-vf yadif_cuda=1:-1:1,scale_npp=w=1280:h=720 \
-c:v h264_nvenc -preset medium \
-refs 3 -bf 3 -rc-lookahead 30 \
-b_ref_mode middle -temporal-aq 1 \
-acodec copy -f mpegts -y /dev/null &
done
wait
echo done

Input is a 1080i50 h264 mpegts file.
You can get it from http://207.154.237.57/files/input_1080i.ts

Decoding → deinterlacing → scaling → encoding lives only in GPU, so I don’t see any memory bandwidth limits anywhere.

Removing “-refs 3 -bf 3 -rc-lookahead 30 -b_ref_mode middle -temporal-aq 1” changes the numbers, but still not all GPU power is used since running on only one gpu gives 86 * 20 = 1720 fps and running on two GPUs gives 8040 = 3200 fps and finally running on three gpus gives 5260 = 3120 fps. So we observe a performance degredation adding 20 more instances in a separate 3rd GPU.

So, to correct my original post, it is not a hard limit of 2500 fps, but there is a limit somewhere running multiple instances.

malakudi · April 8, 2019, 9:05am

Changing scaling to scale_npp=w=704:h=576 does not change the results at all, so the bottleneck is not in the nvenc part. When running all 60 instances, checking with nvidia-smi dmon shows underutilization of nvdec (reaches only 60% an all three GPUs).

malakudi · April 8, 2019, 12:21pm

Did a simpler test that only tests decoder+filters, here is the sample script:

#!/bin/bash
DEVICEID=$1
for i in `seq 1 20` ;
do 
ffmpeg -loglevel fatal -stats \
-hwaccel cuvid -hwaccel_device $DEVICEID \
-c:v h264_cuvid -surfaces 12  \
-f mpegts -i input_1080i.ts \
-vf yadif_cuda=1:-1:1,scale_npp=w=1280:h=720 \
-f null -y /dev/null &
done
wait
echo done

I even increased sessions to 30. There is no limit in this test, nvdec is utilized 100% on all three gpus, there is no dramatic reduction in running 30, 60 or 90 sessions in 1,2 or three gpus concurrently.

results of the above test
20 sessions in one gpu, 2090 => 1800 fps
40 sessions in two gpus, 4088 => 3520 fps
60 sessions in three gpus, 60*86 => 5160 fps

So the issue must be on multiple nvenc sessions.

malakudi · April 8, 2019, 5:56pm

The issue seems to be something in the ffmpeg code.

Before the following three commits, there is no problem.

@langdalepl: If I remember well there was a ticket in trac from a user mentioning the same issue. I can’t find it right now.

Applying following patch on current git fixes the issue but I don’t know how to explain it.

--- ffmpeg/libavcodec/nvenc.c	2019-04-08 20:53:19.745925070 +0300
+++ ffmpeg/libavcodec/nvenc.c	2019-04-08 20:55:51.619074973 +0300
@@ -1846,13 +1846,6 @@
                 res = nvenc_print_error(avctx, nv_status, "Failed unmapping input resource");
                 goto error;
             }
-            nv_status = p_nvenc->nvEncUnregisterResource(ctx->nvencoder, ctx->registered_frames[tmpoutsurf->reg_idx].regptr);
-            if (nv_status != NV_ENC_SUCCESS) {
-                res = nvenc_print_error(avctx, nv_status, "Failed unregistering input resource");
-                goto error;
-            }
-            ctx->registered_frames[tmpoutsurf->reg_idx].ptr = NULL;
-            ctx->registered_frames[tmpoutsurf->reg_idx].regptr = NULL;
         } else if (ctx->registered_frames[tmpoutsurf->reg_idx].mapped < 0) {
             res = AVERROR_BUG;
             goto error;

brainiarc7 · April 8, 2019, 6:05pm

Hmm, could you open a ticket on trac.ffmpeg.org with this?
This is a serious regression.

malakudi · April 8, 2019, 6:08pm

@brainiarc7: There was a ticket on trac for the same issue, but original author didn’t explain the solution well for current git and got unnoticed. I can’t find it right now but I will and reopen it.

brainiarc7 · April 8, 2019, 6:16pm

Cool, thanks.
Remember to propose the patch above.

malakudi · April 8, 2019, 6:48pm

Found it, this was the ticket:
http://trac.ffmpeg.org/ticket/7674

Will update its status there.

sharma.mayank2125 · May 18, 2019, 3:35pm

@malakudi any pointers as to how to capture multiple rtsp streams simultaneously and decode the frames,should i spawn new cpu process for each?

val.zapod.vz · August 20, 2022, 7:55am

Is the issue still fixed with latest ffmpeg 5.1 and master?

Topic		Replies	Views
Encoding multiple video limited to 2 encodes CUDA Programming and Performance	8	8037	December 19, 2016
Buy several GTX cards or a simple Quadro card Video Processing & Optical Flow	3	860	December 5, 2018
2x GTX1050 and only two encoding streams? Video Processing & Optical Flow	1	2498	August 21, 2017
NVEnc Details General Topics and Other SDKs	1	873	May 21, 2021
is there limit for multi-thread encoder? Video Processing & Optical Flow	6	3610	October 12, 2021
Fluctuating performance with two instances of NVENC Video Processing & Optical Flow nvenc	1	1116	June 25, 2025
Identify bottleneck on simultaneous encodes using Quadro M4000 CUDA Programming and Performance	3	1393	November 21, 2016
Session count limitation for NVENC (No Maxwell GPUs with 2+ NEVENC sessions?) GPU-Accelerated Libraries	25	33883	February 26, 2018
Multiple x264 encoding sessions with nvenc GeForce GTX 970 GPU - Hardware	0	1590	September 26, 2016
How to Increase number of NVENC concurrent sessions Video Processing & Optical Flow	3	5358	July 3, 2024

Performance limit at around 2500 fps?

Related topics