Quadro P4000 encoder count

Main page of NVidia Codec SDK say that Quadro P4000 have 2 NVENC engines. This noted in section “NVENC - Hardware-Accelerated Video Encoding” in diagram “ENCODE PERFOMANCE” shown that “P4000/P5000/P6000” have x2 NVENC engines:
[url]https://developer.nvidia.com/nvidia-video-codec-sdk[/url]

But, in other page “GPU Support Matrix” is noted that Quadro P4000 have only one NVENC engine:
[url]https://developer.nvidia.com/video-encode-decode-gpu-support-matrix#Encoder[/url]

Quadro P4000 is based on GP104 chip that have two NVENC. But someone say, that one of them is disabled in P4000, is it true? How many enabled NVENC engines Quadro P4000 really have?

My opinion: Nvidia is not capable to correctly describe capabilities of its own products (see [url]https://devtalk.nvidia.com/default/topic/992447/[/url]).

Yeah, it’s true.
Do you have Quadro P2000? Can you, please, check performance of this card?
Need to run two ffmpeg encoding processes by (run this command two times simultaneously):

ffmpeg -hwaccel cuvid -c:v h264_cuvid -i 1080.h264 -c:v h264_nvenc -f null -

Source file 1080.h264 available at: https://yadi.sk/d/7PIzHUEn3XxemM

Need to see encoding fps of P2000. I have Quadro P4000 and in my system results of this commands:
frame= 5606 fps=258 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=10.8x
frame= 5606 fps=257 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=10.7x

If I run only one encoding process, I see this results:
frame= 5606 fps=466 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=19.4x

(P2000 passthrough to guest 4.16.14-200.fc27.x86_64, Nvidia driver 396.26, CUDA 9.2, Video Codec SDK 8.2, ffmpeg from git commit 8331e591)

# ###### with maximum clocks:
# nvidia-smi -ac <b>3504,1721</b>
# ./ffmpeg -nostdin -hwaccel cuvid -c:v h264_cuvid -i /root/1080.h264 -c:v h264_nvenc -f null -
<b>frame= 5606 fps=506 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=21.1x    </b>
# nvidia-smi dmon -c 1
 # gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
 # Idx     W     C     %     %     %     %   MHz   MHz
     0    32    55    10    20   100    77  3499  1721
# for i in 1 2; do ./ffmpeg -nostdin -hwaccel cuvid -c:v h264_cuvid -i /root/1080.h264 -c:v h264_nvenc -f null - & done
[b]frame= 5606 fps=254 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=10.6x    
frame= 5606 fps=254 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=10.6x    [/b]
# nvidia-smi dmon -c 1
 # gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
 # Idx     W     C     %     %     %     %   MHz   MHz
     0    31    51    12    20   100    72  3499  1721
# for i in 1 2 3 4; do ./ffmpeg -nostdin -hwaccel cuvid -c:v h264_cuvid -i /root/1080.h264 -c:v h264_nvenc -f null - & done
[b]frame= 5606 fps=127 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=5.31x    
frame= 5606 fps=127 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=5.31x    
frame= 5606 fps=127 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=5.31x    
frame= 5606 fps=127 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=5.31x    [/b]
# nvidia-smi dmon -c 1
 # gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
 # Idx     W     C     %     %     %     %   MHz   MHz
     0    31    55    13    20   100    73  3499  1721
# nvidia-smi pmon -c 1
 # gpu        pid  type    sm   mem   enc   dec   command
 # Idx          #   C/G     %     %     %     %   name
     0      19117     C     5     7     8    12   ffmpeg         
     0      19118     C     2     3    33    12   ffmpeg         
     0      19119     C     1     1    16    30   ffmpeg         
     0      19120     C     4     6    41    18   ffmpeg         
# ###### with default clocks:
# nvidia-smi -ac <b>3504,1075</b>
# ./ffmpeg -nostdin -hwaccel cuvid -c:v h264_cuvid -i /root/1080.h264 -c:v h264_nvenc -f null - 
<b>frame= 5606 fps=307 q=28.0 Lsize=N/A time=00:03:53.79 bitrate=N/A speed=12.8x</b>

Very thanks. Actually, information om main page of NVidia Codec SDK is wrong, Quadro P4000 have only one enabled NVENC, not two as noted :(

This result is for one encoding stream without overclocking?

Seems difference in encoding performance between P2000 and P4000 is a result of different clock rates only. In default settings P4000 have same encoding performance as overclocked P2000.

P2000 has lower default clocks but can be changed to maximum clocks (not overclocking) - see “nvidia-smi -q” - https://devtalk.nvidia.com/default/topic/992447/#5148941.

P4000 has crippled GP104 chip (six SM and one nvENC disabled due to HW error in production) - see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Quadro_Pxxx_series (btw. I do not known where wiki get such invalid “clockrates”). Can you post “nvidia-smi -q” for true values of P4000 ?

You can get more SM and more memory BW with P4000 but you are limited by nvENC (see “nvidia-smi dmon -c 1” in previous posts). Even you buy P5000 with two nvENC you get limit by nvDEC in transcoding situation because all NVidia chips have one nvDEC (see https://developer.nvidia.com/video-encode-decode-gpu-support-matrix#Decoder).

Is it safe to set maximum clocks for card that under encoding load at 24/7/365?
Can I just save some money, buy P2000 and update default clocks or better to buy P4000 and use dafault clocks?

Timestamp                           : Mon Jun 18 20:04:14 2018
Driver Version                      : 391.58

Attached GPUs                       : 1
GPU 00000000:65:00.0
    Product Name                    : Quadro P4000
    Product Brand                   : Quadro
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : N/A
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : WDDM
        Pending                     : WDDM
    Minor Number                    : N/A
    VBIOS Version                   : 86.04.56.00.0B
    MultiGPU Board                  : No
    Board ID                        : 0x6500
    GPU Part Number                 : 900-5G410-1750-000
    Inforom Version
        Image Version               : G410.0501.00.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x65
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1BB110DE
        Bus Id                      : 00000000:65:00.0
        Sub System Id               : 0x11A310DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 9000 KB/s
        Rx Throughput               : 27000 KB/s
    Fan Speed                       : 46 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 8192 MiB
        Used                        : 449 MiB
        Free                        : 7743 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 229 MiB
        Free                        : 27 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 20 %
        Memory                      : 14 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 32 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 9.62 W
        Power Limit                 : 105.00 W
        Default Power Limit         : 105.00 W
        Enforced Power Limit        : 105.00 W
        Min Power Limit             : 60.00 W
        Max Power Limit             : 105.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1202 MHz
        Memory                      : 3802 MHz
    Default Applications Clocks
        Graphics                    : 1202 MHz
        Memory                      : 3802 MHz
    Max Clocks
        Graphics                    : 1708 MHz
        SM                          : 1708 MHz
        Memory                      : 3802 MHz
        Video                       : 1544 MHz
    Max Customer Boost Clocks
        Graphics                    : 1708 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A

I suppose that P4000 activate “boost” (to 1708 MHz and not in default 1202 MHz) when you run encoding (verify this with “nvidia-smi dmon”) and I suppose that P2000 has wrong “boost” management mechanism (not surprise for me) therefore I must manually fix the “graphics” clock. BTW NVidia power management (P states) is unusable in many cases (see https://gridforums.nvidia.com/default/topic/378/).

There is safety mechanism “Clocks Throttle Reasons” that should lower clock for many reasons (see “nvidia-smi -q” section) when running at max speed.

I suppose that P2000 previous test show low utilization load on SM (~10%) and MEM (~20%) that save power and total power draw is only ~30W (less then half of 75W TDP) and it should be OK for 24/7.

… but all is speculation …

Hi naviset,

Just to clarify, what you are seeing on the “Encode Performance” section of the Video Codec SDK Main page refers to the number of SIMULTANEOUS ENCODING SESSIONS. This is different from the NVENC engines.

Thanks,

Ryan Park

Problematic P4000 in images on main page (based on performance tests in this thread):

Actually, yes you two are correct,

That is an error we should have updated on the website. Thank you for calling it out, that was my misunderstanding.

Thanks,
Ryan Park

So is ‘number of streams’ in the diagram above the maximum simultaneous H264 encoder sessions for that hardware? I have a somewhat older laptop with a K1100M. It appears to support 1 GPU and 2 Streams. If I try to create 3 streams I get NV_ERR_OUT_OF_MEMORY on the 3rd call to nvEncOpenEncodeSessionEx(). If I were on a machine with better hardware would I be able to create 4 or 7 sessions as above? Also is there a programmatic way to determine how many sessions are available prior to calling nvEncOpenEncodeSessionEx() ?

No. You (and NVidia) are mixing API concurrent session limit and performance estimation in 1080p30/4kp30 streams.

Concurrent encoder sessions is limited by API. There are only two options - max 2 sessions or unrestricted sessions (see https://developer.nvidia.com/video-encode-decode-gpu-support-matrix).

Number of 1080p30/4kp30 encoded streams is performance metrics that depends on number of encoders (nvenc), generation of chip (Kepler/Maxwell/Maxwell2gen/Pascal/Volta), chip frequency and encoder parameters.

So, NVidia limits in API only 2 session for your hardware (K1100M = GK107 = CoreClock 716 Mhz) with one hw encoder (nvenc) Kepler and the performance estimations are for “High Performance”/“Constant QP” (maximum FPS) == cca 219FPS for 1080p (yes, hw is capable in one session this 1080p FPS) == 7 x 1080p30 (but your hw is limited by two sessions by API) == 1 x 4kp30 and for “High Quality”/“Dual Pass” (highest quality) == cca 57 FPS for 1080p == 1x 1080p30 == none 4kp30.