Encoding multiple video limited to 2 encodes

I am playing around with ffmpeg + nvenc and have come up against the 2 encode limit.

Is there any way around this limit on the GTX970?

Is there a card which will do more that 2 encodes?

Is there some website/pdf that shows these limitations as I have looked and cant find anything?

thanks

joolz

I am working against the same difficulty. It seems you need to get a Quadro version to eliminate the limit, though I believe you could also purchase 4 GTX 960 for just over $100 each and be able to encode 8 sessions.

I haven’t found any data to say whether you would be better off with a single Quadro M6000 for $2500 or the Tesla M60 for $3000 than to just buy a few more cards. That being said, there is a limit to how many PCIe slots you can use, so you would hit the limit at 8 with the GTX series but could potentially do 100 sessions or so (maybe) if you felt like dumping $20k into the high end cards. As I said, I haven’t found any great data other than wikipedia.

For what it is worth, I had a Quadro K2200 that I tested with FFMPEG using the following command:

ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output.mp4

I ran 7 simultaneous encodes on a GoPro file that with these stats:

Stream #0:0(eng): Video: h264 (h264_nvenc) (Main) ([33][0][0][0] / 0x0021), yuv420p, 3840x2160 [SAR 1:1 DAR 16:9], q=-1--1, 2000 kb/s, 23.98 fps, 24k tbn, 23.98 tbc (default)

When I started the first process, the speed was about 1.5x, but after starting 7 processes it dropped to 0.581x. Using the program “htop” I can see that I am not maxing out any CPU process:

1  [######                               11.9%]   9  [##########*                          23.5%]   17 [#######                              15.1%]   25 [#########                            19.1%]
  2  [#####################*               49.7%]   10 [##################*                  43.4%]   18 [##########                           23.2%]   26 [                                      0.0%]
  3  [###########*                         25.0%]   11 [#######                              14.6%]   19 [#####                                11.3%]   27 [####                                  7.3%]
  4  [##########                           23.2%]   12 [############                         27.2%]   20 [##########*                          22.4%]   28 [######                               13.8%]
  5  [#####                                11.8%]   13 [############                         26.5%]   21 [                                      0.0%]   29 [##                                    4.6%]
  6  [#################*                   40.8%]   14 [###############*                     34.9%]   22 [####                                  7.9%]   30 [#########                            19.7%]
  7  [#########*                           19.9%]   15 [####                                  9.2%]   23 [########                             17.1%]   31 [##############*                      32.7%]
  8  [############*                        27.6%]   16 [#################                    38.4%]   24 [###                                   6.6%]   32 [#########                            20.4%]
  Mem[|||||||#*                                                                         10.7G/157G]   Tasks: 48, 149 thr; 9 running 
  Swp[                                                                                    0K/4.00G]   Load average: 4.56 2.79 1.20 
                                                                                                      Uptime: 17:37:29

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                                                                                                      
23970 root       20   0  173G 1360M  885M S 97.0  0.8  2:12.86 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output5.mp4                                                                                           
23894 root       20   0  173G 1362M  885M R 97.0  0.8  3:33.77 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output2.mp4
23874 root       20   0  173G 1362M  885M R 94.3  0.8  3:40.42 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output1.mp4
24004 root       20   0  173G 1359M  885M R 93.7  0.8  2:09.87 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output6.mp4
23914 root       20   0  173G 1362M  885M S 90.4  0.8  3:24.82 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output3.mp4
23934 root       20   0  173G 1362M  885M S 89.1  0.8  3:27.35 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output4.mp4
23798 root       20   0  173G 1370M  885M S 87.7  0.9  6:44.86 ffmpeg -i GOPR0012.MP4 -c:v h264_nvenc output.mp4

And nvidia-smi shows the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K2200        Off  | 0000:42:00.0     Off |                  N/A |
| 45%   60C    P0     7W /  39W |   3965MiB /  4041MiB |     20%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     23798    C   ffmpeg                                         565MiB |
|    0     23874    C   ffmpeg                                         565MiB |
|    0     23894    C   ffmpeg                                         565MiB |
|    0     23914    C   ffmpeg                                         565MiB |
|    0     23934    C   ffmpeg                                         565MiB |
|    0     23970    C   ffmpeg                                         565MiB |
|    0     24004    C   ffmpeg                                         565MiB |
+-----------------------------------------------------------------------------+

I tried to start another encode, but ran out of memory on the card. If you aren’t doing 4K video, you could start more simultaneously as they take less memory per process. I also have a Grid M40 (which was a purchasing mistake: https://devtalk.nvidia.com/default/topic/976633/nvidia-gtx-960-to-grid-m40-upgrade-doesn-t-work-/#5018751) I will run some tests using it just for fun and let you know what happens.

FYI, the computer I am running it on is a Dell R720xd with 160GB RAM and dual Xeon 8 core 2.9Ghz on a RAID 1 10,000RPM (2 drives).

Running ffmpeg using the codec libx264 gives me an output speed of around 0.755x, so the GPU is definitely a great addition, but if you want h.265 / HEVC make sure you get a 2nd gen maxwell card.

After testing with the Grid M40, it seems just a little slower than the K2200:

frame=  866 fps=9.8 q=46.0 size=    8855kB time=00:00:36.73 bitrate=1974.7kbits/s speed=0.417x

and I get the same 7 streams before running out of memory… BUT the card has 4 cores, and I was only using 1:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID M40            Off  | 0000:44:00.0     Off |                  N/A |
| 59%   60C    P0    24W /  53W |   3965MiB /  4041MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID M40            Off  | 0000:45:00.0     Off |                  N/A |
| 45%   47C    P8     8W /  53W |      2MiB /  4041MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID M40            Off  | 0000:46:00.0     Off |                  N/A |
| 38%   38C    P8     8W /  53W |      2MiB /  4041MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID M40            Off  | 0000:47:00.0     Off |                  N/A |
| 39%   42C    P8     8W /  53W |      2MiB /  4041MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3005    C   ffmpeg                                         565MiB |
|    0      3033    C   ffmpeg                                         565MiB |
|    0      3053    C   ffmpeg                                         565MiB |
|    0      3074    C   ffmpeg                                         565MiB |
|    0      3094    C   ffmpeg                                         565MiB |
|    0      3114    C   ffmpeg                                         565MiB |
|    0      3134    C   ffmpeg                                         565MiB |
+-----------------------------------------------------------------------------+

I know you can specify which GPU to usewith something like this:

ffmpeg -hwaccel_device 0 -hwaccel cuvid -c:v h264_cuvid -i input -vf scale_npp=-1:720 -c:v h264_nvenc -preset slow output.mkv

https://trac.ffmpeg.org/wiki/HWAccelIntro

But I didn’t take it that far since I am not keeping the card, but I would assume similar results for each GPU.

I have a Quadro M4000 coming tomorrow, I can let you know how it does if you need more info.

This is weird, had a reply but cant see it here so replying to my self and ajhalls

I have ordered the M4000 as a test, will get it tomorrow.

We already purchased a specialized hevc encoder card but that is up in the $7500 range, and it was limited to 4 HD streams, the GTX is doing 2 HD at 9% usage.

Im wondering if the limit is in the drivers or the hardware pipelines?

Will see if we can beat the 4 HD with the Quadra card, or 16 SD channels?

joolz

It is an intentional limit from nvidia to drive sales of the higher end cards. I have seen some threads other places about hacking the cards to unlock them, but didn’t have any desire to get in that deep myself.

I got my M4000 today and tried it out. I currently have 33 simultaneous encodes using GNU Parallels going on with some room to spare memory wise in case there are any incoming videos that take up more space. The first encode that I started manually was a 1080p video going at around 150fps, but after starting the other 32 it dropped to about 40fps. I couldn’t get any detailed statistics on the 32 as they were running in a script that didn’t output anything.

Here are a couple things that might be interesting to you:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 0000:42:00.0     Off |                  N/A |
| 56%   68C    P0    48W / 120W |   5794MiB /  8120MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3423    C   ffmpeg                                         183MiB |
|    0      3558    C   ffmpeg                                         114MiB |
|    0      3623    C   ffmpeg                                         183MiB |
|    0      3624    C   ffmpeg                                         183MiB |
|    0      3625    C   ffmpeg                                         183MiB |
|    0      3626    C   ffmpeg                                         183MiB |
|    0      3627    C   ffmpeg                                         183MiB |
|    0      3628    C   ffmpeg                                         183MiB |
|    0      3629    C   ffmpeg                                         183MiB |
|    0      3865    C   ffmpeg                                         183MiB |
|    0      3866    C   ffmpeg                                         183MiB |
|    0      3928    C   ffmpeg                                         183MiB |
|    0      3978    C   ffmpeg                                         183MiB |
|    0      4015    C   ffmpeg                                         183MiB |
|    0      4049    C   ffmpeg                                         114MiB |
|    0      4078    C   ffmpeg                                         183MiB |
|    0      4103    C   ffmpeg                                         183MiB |
|    0      4422    C   ffmpeg                                         183MiB |
|    0      4558    C   ffmpeg                                         114MiB |
|    0      4634    C   ffmpeg                                         183MiB |
|    0      4717    C   ffmpeg                                         183MiB |
|    0      4803    C   ffmpeg                                         183MiB |
|    0      4881    C   ffmpeg                                         114MiB |
|    0      4955    C   ffmpeg                                         183MiB |
|    0      5007    C   ffmpeg                                         183MiB |
|    0      5067    C   ffmpeg                                         183MiB |
|    0      5068    C   ffmpeg                                         183MiB |
|    0      5136    C   ffmpeg                                         183MiB |
|    0      5181    C   ffmpeg                                         183MiB |
|    0      5222    C   ffmpeg                                         183MiB |
|    0      5255    C   ffmpeg                                         183MiB |
|    0      5285    C   ffmpeg                                         183MiB |
|    0      5310    C   ffmpeg                                         183MiB |
+-----------------------------------------------------------------------------+

I guess I don’t understand what Volatile GPU-Util means, I expected that to be higher, like CPU load, with 33 videos being encoded.

Here is htop:

1  [#####*                   15.2%]    9  [#####*                   18.4%]   17 [##*                       7.2%]    25 [#*                        3.3%]
  2  [########*                24.1%]    10 [#######*                 23.8%]   18 [##                        4.6%]    26 [#*                        4.0%]
  3  [######*                  20.0%]    11 [########*                27.0%]   19 [##*                       7.3%]    27 [#*                        2.7%]
  4  [########*                25.2%]    12 [#######*                 23.2%]   20 [##*                       5.9%]    28 [#*                        2.0%]
  5  [#####*                   17.9%]    13 [#######*                 21.1%]   21 [#*                        4.0%]    29 [#*                        2.6%]
  6  [########*                27.0%]    14 [######*                  19.2%]   22 [#*                        2.6%]    30 [#*                        3.9%]
  7  [#######*                 24.3%]    15 [#######*                 23.3%]   23 [#*                        3.3%]    31 [##*                       4.6%]
  8  [########*                26.3%]    16 [#########*               28.9%]   24 [#*                        3.3%]    32 [#                         2.6%]
  Mem[||||||||#**                                                17.0G/157G]   Tasks: 104, 643 thr; 5 running 
  Swp[                                                             0K/4.00G]   Load average: 4.55 4.48 3.38 
                                                                               Uptime: 00:31:07

And here is iostat to see hard drive utilization:

Total DISK READ :    1068.54 K/s | Total DISK WRITE :    1495.22 K/s
Actual DISK READ:    1068.54 K/s | Actual DISK WRITE:     233.74 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND   
 4422 be/4 root        0.00 B/s   48.23 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4103 be/4 root      118.73 K/s   29.68 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3558 be/4 root        0.00 B/s   18.55 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4558 be/4 root        0.00 B/s   22.26 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4634 be/4 root      118.73 K/s   77.91 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4717 be/4 root        0.00 B/s   66.78 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4803 be/4 root      118.73 K/s   55.65 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4881 be/4 root        0.00 B/s   14.84 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5007 be/4 root      118.73 K/s  103.89 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5068 be/4 root        0.00 B/s   81.62 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5136 be/4 root      118.73 K/s   44.52 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5181 be/4 root      118.73 K/s   29.68 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5222 be/4 root      118.73 K/s   48.23 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5255 be/4 root        0.00 B/s   59.36 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5285 be/4 root        0.00 B/s   66.78 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5310 be/4 root        0.00 B/s   51.94 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4955 be/4 root        0.00 B/s   44.52 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3423 be/4 root        0.00 B/s   48.23 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3623 be/4 root        0.00 B/s   59.36 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3626 be/4 root        0.00 B/s    3.71 K/s  0.00 %  0.00 % ffmpeg -n -i 
 5067 be/4 root        0.00 B/s    3.71 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3866 be/4 root        0.00 B/s   44.52 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3928 be/4 root        0.00 B/s   37.10 K/s  0.00 %  0.00 % ffmpeg -n -i 
 3978 be/4 root      118.73 K/s  115.02 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4015 be/4 root        0.00 B/s   70.49 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4049 be/4 root        0.00 B/s   22.26 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4078 be/4 root      118.73 K/s   51.94 K/s  0.00 %  0.00 % ffmpeg -n -i 
 4096 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % ffmpeg -n -i

So dropping encoding speed by two thirds but increasing work by 33 is a great outcome.

After some time playing with it, it looks like 5-6 streams is optimal. I noticed that I was not having enough HD activity to be writing 32 files. I am getting around 100fps with the 6 files and a write speed on the HD of around 3Mb/s. I am not sure what the bottleneck is though since the HD can do 150 Mb/s when copying files and there is still leftover memory and CPU usage. I guess the GPU is overloaded, but not sure where to see the reported usage if Volatile GPU-Util isn’t it, which is only around 20% usage.

I still have 140GB unused ram, plenty of CPU, and the HDD isn’t overloaded unless it is a seek timing issue. I guess I could try a ramdisk to test it, but if anyone has a tip I would appreciate it.

Check nvdec/nvenc usage:

nvidia-smi dmon

I think nvdec/nvenc are bottlenecks. GPU-Util - CUDA usage, make sense when you deinterlacing/scaling video