I have reached a performance limit on single computer utilising multiple GPUs. It seems there is a limit at around 2500 fps.
My test server is a Threadripper 2950X with three (3) Quadro P5000.
I run 20 concurrent encodes on a single Quadro P5000 and get 65 fps per session => 1300 fps total.
Then I run another 20 concurrent encodes on the second Quadro P5000 on the same PC and get 62,5 fps per session => 2500 fps total (and not 2600)
Finally I run another 20 concurrent encodes on the third Quadro P5000 on the same PC and get 41 fps per session => 2460 fps (and not 3900)
So adding a third Quadro P5000 on the same PC offers no performance improvement at all!!!
Setting encoding quality to fast preset doesn’t change the result (tests were done with medium preset), still capped at around 2500 fps. CPU usage is very low, so this is not a bottleneck.
Any thoughts from Nvidia?
PS: Driver version is 418.56, Linux Ubuntu 18.04 LTS, each session is using ffmpeg latest git.
Decoding -> deinterlacing -> scaling -> encoding lives only in GPU, so I don’t see any memory bandwidth limits anywhere.
Removing “-refs 3 -bf 3 -rc-lookahead 30 -b_ref_mode middle -temporal-aq 1” changes the numbers, but still not all GPU power is used since running on only one gpu gives 86 * 20 = 1720 fps and running on two GPUs gives 8040 = 3200 fps and finally running on three gpus gives 5260 = 3120 fps. So we observe a performance degredation adding 20 more instances in a separate 3rd GPU.
So, to correct my original post, it is not a hard limit of 2500 fps, but there is a limit somewhere running multiple instances.
Changing scaling to scale_npp=w=704:h=576 does not change the results at all, so the bottleneck is not in the nvenc part. When running all 60 instances, checking with nvidia-smi dmon shows underutilization of nvdec (reaches only 60% an all three GPUs).
I even increased sessions to 30. There is no limit in this test, nvdec is utilized 100% on all three gpus, there is no dramatic reduction in running 30, 60 or 90 sessions in 1,2 or three gpus concurrently.
results of the above test
20 sessions in one gpu, 2090 => 1800 fps
40 sessions in two gpus, 4088 => 3520 fps
60 sessions in three gpus, 60*86 => 5160 fps