I have a Dell R720xd server with 160GB RAM, dual 8 core 2.9GHz processors and 12x2TB hard drives in a lvl 10 RAID and a newly installed Quadro M4000 w/ 8GB RAM.
I wanted to increase my encoding speed of files on a regular basis and am trying to find the limits and the bottlenecks. I tried to do 32 simultaneous encodes, which used up about 5GB of the RAM, and the Volatile GPU-Util was around 20%. The reason I realized it wasn’t going at the speed I thought was because the HDD was only reporting about 1.5Mb/s write speed, while when I scale it back to 6 processes I get over 3Mb/s, and nearly the same GPU-Util as when I was encoding 32.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro M4000 Off | 0000:42:00.0 Off | N/A |
| 57% 69C P0 48W / 120W | 906MiB / 8120MiB | 18% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9286 C ffmpeg 150MiB |
| 0 9364 C ffmpeg 150MiB |
| 0 9433 C ffmpeg 150MiB |
| 0 9477 C ffmpeg 150MiB |
| 0 9537 C ffmpeg 150MiB |
| 0 9679 C ffmpeg 150MiB |
+-----------------------------------------------------------------------------+
I am trying to figure out what is slowing it down as it doesn’t seem like it should be anything BUT the GPU since nothing else seems to be maxed out. As more files come in from my users that need to be processed, I don’t know how I am going to scale the solution to where it doesn’t start getting behind.
If I wanted it to go faster, how do I find what the bottleneck is? Do I need another GPU? Do I need SSD? SAS 12GB/s?
I know the Cuda SDK comes with some samples, is there one that will find a bottleneck?
Mbit/s or MByte/s? I am not familiar with video transcoding. Typical sequential HDD throughput is 150 MB/sec, consumer SSD 600 MB/sec, enterprise SSD 1.5 GB/sec, server system memory throughput 40-60 GB/sec, PCIe gen 3 x16 about 11 GB/sec per direction, GPU memory throughput 50-500 GB/sec. So to first order it doesn’t look to me you would be limited by the raw throughput of the chain of hardware devices, even considering combined read + write traffic.
A reasonable working hypothesis therefore is that you are limited by the computational throughput of the video encoders/decoders of the GPU. I assume your software uses NVENC, which is a separate hardware module unrelated to CUDA?
Sorry for my careless error, it was in fact MByte, and yes I can easily see it reaching 150 MB/s during a file copy process even while it is in the middle of encoding 5 videos. I did learn a new command looking at some other threads which was nvidia-smi -a which gives a lot more info:
==============NVSMI LOG==============
Timestamp : Mon Nov 21 10:56:52 2016
Driver Version : 367.48
Attached GPUs : 1
GPU 0000:42:00.0
Product Name : Quadro M4000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0320116050404
GPU UUID : GPU-4366507c-579b-62d6-71b3-f9791dd6c3ff
Minor Number : 0
VBIOS Version : 84.04.70.00.07
MultiGPU Board : No
Board ID : 0x4200
GPU Part Number : N/A
Inforom Version
Image Version : G400.0501.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x42
Device : 0x00
Domain : 0x0000
Device Id : 0x13F110DE
Bus Id : 0000:42:00.0
Sub System Id : 0x115310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 152000 KB/s
Rx Throughput : 1169000 KB/s
Fan Speed : 54 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 8120 MiB
Used : 514 MiB
Free : 7606 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 24 %
Memory : 5 %
Encoder : 100 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 62 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 99 C
Power Readings
Power Management : Supported
Power Draw : 47.19 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 10.00 W
Max Power Limit : 120.00 W
Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3004 MHz
Video : 712 MHz
Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Default Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Max Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3005 MHz
Video : 710 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
Process ID : 33949
Type : C
Name : ffmpeg
Used GPU Memory : 150 MiB
Process ID : 34009
Type : C
Name : ffmpeg
Used GPU Memory : 81 MiB
Process ID : 34035
Type : C
Name : ffmpeg
Used GPU Memory : 81 MiB
Process ID : 34076
Type : C
Name : ffmpeg
Used GPU Memory : 115 MiB
Process ID : 34116
Type : C
Name : ffmpeg
Used GPU Memory : 81 MiB
In particular I noticed the GPU info broken down more to show:
I backed off how many sessions I was encoding and it didn’t move away from 100% till I got down to about 3-4, but I didn’t see that there was a performance change in the videos being encoded as I terminated extra processes.
To explain - if they were being encoded at 90fps with 8 processes, and I scaled back to 4, the remaining 4 would still be around 90fps, even when the encoder percentage dropped to 95%. Sure it may have budged to 95-100fps, but the total throughput dropped a lot since originally I had 8x90 = 720fps, and when I drop to 4 processes it would be 4x100 = 400fps. So that makes me wonder if that particular metric is relevant - or being correctly reported.
Perhaps it is like the linux top command where the “load” is reported as a per core thing, so if I had a quad core processor with hyperthreading 100% utilization would actually be a load of 8.
If this GPU has multicore, maybe it is reporting 100% of the first core, but it has plenty more to give, but if that is true, by what metric can I evaluate whether it can handle additional work?