Identify bottleneck on simultaneous encodes using Quadro M4000

ajhalls · November 16, 2016, 9:59pm

I have a Dell R720xd server with 160GB RAM, dual 8 core 2.9GHz processors and 12x2TB hard drives in a lvl 10 RAID and a newly installed Quadro M4000 w/ 8GB RAM.

I wanted to increase my encoding speed of files on a regular basis and am trying to find the limits and the bottlenecks. I tried to do 32 simultaneous encodes, which used up about 5GB of the RAM, and the Volatile GPU-Util was around 20%. The reason I realized it wasn’t going at the speed I thought was because the HDD was only reporting about 1.5Mb/s write speed, while when I scale it back to 6 processes I get over 3Mb/s, and nearly the same GPU-Util as when I was encoding 32.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 0000:42:00.0     Off |                  N/A |
| 57%   69C    P0    48W / 120W |    906MiB /  8120MiB |     18%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      9286    C   ffmpeg                                         150MiB |
|    0      9364    C   ffmpeg                                         150MiB |
|    0      9433    C   ffmpeg                                         150MiB |
|    0      9477    C   ffmpeg                                         150MiB |
|    0      9537    C   ffmpeg                                         150MiB |
|    0      9679    C   ffmpeg                                         150MiB |
+-----------------------------------------------------------------------------+

I am trying to figure out what is slowing it down as it doesn’t seem like it should be anything BUT the GPU since nothing else seems to be maxed out. As more files come in from my users that need to be processed, I don’t know how I am going to scale the solution to where it doesn’t start getting behind.

If I wanted it to go faster, how do I find what the bottleneck is? Do I need another GPU? Do I need SSD? SAS 12GB/s?

I know the Cuda SDK comes with some samples, is there one that will find a bottleneck?

ajhalls · November 18, 2016, 6:00pm

Just hoping someone will see this.

njuffa · November 18, 2016, 9:45pm

Mbit/s or MByte/s? I am not familiar with video transcoding. Typical sequential HDD throughput is 150 MB/sec, consumer SSD 600 MB/sec, enterprise SSD 1.5 GB/sec, server system memory throughput 40-60 GB/sec, PCIe gen 3 x16 about 11 GB/sec per direction, GPU memory throughput 50-500 GB/sec. So to first order it doesn’t look to me you would be limited by the raw throughput of the chain of hardware devices, even considering combined read + write traffic.

A reasonable working hypothesis therefore is that you are limited by the computational throughput of the video encoders/decoders of the GPU. I assume your software uses NVENC, which is a separate hardware module unrelated to CUDA?

ajhalls · November 21, 2016, 4:05pm

Sorry for my careless error, it was in fact MByte, and yes I can easily see it reaching 150 MB/s during a file copy process even while it is in the middle of encoding 5 videos. I did learn a new command looking at some other threads which was nvidia-smi -a which gives a lot more info:

==============NVSMI LOG==============

Timestamp                           : Mon Nov 21 10:56:52 2016
Driver Version                      : 367.48

Attached GPUs                       : 1
GPU 0000:42:00.0
    Product Name                    : Quadro M4000
    Product Brand                   : Quadro
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0320116050404
    GPU UUID                        : GPU-4366507c-579b-62d6-71b3-f9791dd6c3ff
    Minor Number                    : 0
    VBIOS Version                   : 84.04.70.00.07
    MultiGPU Board                  : No
    Board ID                        : 0x4200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G400.0501.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x42
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x13F110DE
        Bus Id                      : 0000:42:00.0
        Sub System Id               : 0x115310DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 152000 KB/s
        Rx Throughput               : 1169000 KB/s
    Fan Speed                       : 54 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 8120 MiB
        Used                        : 514 MiB
        Free                        : 7606 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 3 MiB
        Free                        : 253 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 24 %
        Memory                      : 5 %
        Encoder                     : 100 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 62 C
        GPU Shutdown Temp           : 104 C
        GPU Slowdown Temp           : 99 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 47.19 W
        Power Limit                 : 120.00 W
        Default Power Limit         : 120.00 W
        Enforced Power Limit        : 120.00 W
        Min Power Limit             : 10.00 W
        Max Power Limit             : 120.00 W
    Clocks
        Graphics                    : 772 MHz
        SM                          : 772 MHz
        Memory                      : 3004 MHz
        Video                       : 712 MHz
    Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Default Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Max Clocks
        Graphics                    : 772 MHz
        SM                          : 772 MHz
        Memory                      : 3005 MHz
        Video                       : 710 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes
        Process ID                  : 33949
            Type                    : C
            Name                    : ffmpeg
            Used GPU Memory         : 150 MiB
        Process ID                  : 34009
            Type                    : C
            Name                    : ffmpeg
            Used GPU Memory         : 81 MiB
        Process ID                  : 34035
            Type                    : C
            Name                    : ffmpeg
            Used GPU Memory         : 81 MiB
        Process ID                  : 34076
            Type                    : C
            Name                    : ffmpeg
            Used GPU Memory         : 115 MiB
        Process ID                  : 34116
            Type                    : C
            Name                    : ffmpeg
            Used GPU Memory         : 81 MiB

In particular I noticed the GPU info broken down more to show:

Utilization
        Gpu                         : 24 %
        Memory                      : 5 %
        Encoder                     : 100 %
        Decoder                     : 0 %

I backed off how many sessions I was encoding and it didn’t move away from 100% till I got down to about 3-4, but I didn’t see that there was a performance change in the videos being encoded as I terminated extra processes.

To explain - if they were being encoded at 90fps with 8 processes, and I scaled back to 4, the remaining 4 would still be around 90fps, even when the encoder percentage dropped to 95%. Sure it may have budged to 95-100fps, but the total throughput dropped a lot since originally I had 8x90 = 720fps, and when I drop to 4 processes it would be 4x100 = 400fps. So that makes me wonder if that particular metric is relevant - or being correctly reported.

Perhaps it is like the linux top command where the “load” is reported as a per core thing, so if I had a quad core processor with hyperthreading 100% utilization would actually be a load of 8.

If this GPU has multicore, maybe it is reporting 100% of the first core, but it has plenty more to give, but if that is true, by what metric can I evaluate whether it can handle additional work?

Topic		Replies	Views
Encoding multiple video limited to 2 encodes CUDA Programming and Performance	8	7975	December 19, 2016
Performance limit at around 2500 fps? Video Processing & Optical Flow	13	1934	August 20, 2022
Quadro P4000 encoder count Video Processing & Optical Flow	12	10699	July 9, 2018
cpu using is incrace each every hardware encoding like %50 gpu %50 cpu CUDA Setup and Installation	0	462	March 27, 2018
M4000 simultaneous h264 encode sessions problem Video Processing & Optical Flow	1	1664	May 8, 2017
M4000 quadro card simultaneous h264 encode sessions General Topics and Other SDKs	4	6867	January 29, 2017
NVEnc Details General Topics and Other SDKs	1	840	May 21, 2021
Idle GPU CUDA Programming and Performance	0	758	April 21, 2016
How many simultaneous h264 encode sessions can run on Quadro M4000( 8 GB GDDR5) card? General Topics and Other SDKs	2	2253	January 26, 2018
NVENC - Performance identical on faster device Video Processing & Optical Flow	1	607	June 6, 2018

Identify bottleneck on simultaneous encodes using Quadro M4000

Related topics