Turing H.264 Video Encoding Speed and Quality

Originally published at: https://developer.nvidia.com/blog/turing-h264-video-encoding-speed-and-quality/

All NVIDIA GPUs starting with Kepler support fully-accelerated hardware video encoding;  GPUs starting with Fermi support fully-accelerated hardware video decoding. The recently released Turing hardware delivered Tensor Cores and better machine learning performance, but the new GPU also incorporated new multimedia features such as an improved NVENC unit to deliver better compression and image quality…

Turing NVENC is very good, we also did tests and see higher quality than libx264 at slow and veryslow presets, bigger difference is for H264 at High profile, where NVENC is better than libx264 by 10-15%!!! Only think sad is that Pascal generation had 2 NVENC engines so performance were two times better :(

While this is a good point, you have to also consider that a single Turing NVENC can outperform a single Pascal NVENC in certain applications. Looking at NVIDIA's initial v9 SDK Tests, the single Turing NVENC H.264 1080p encoding performance is between 22% (High Quality) and 30% (Low Latency) faster than a single Pascal NVENC would be. However, as you pointed out, in the 4K HEVC test, the single NVENC encoding performance is the same between both Pascal and Turing (and I would assume Volta as well).

NVIDIA's customers will have to weigh the features and functions they need before deciding which generation of card to purchase. Taking what you pointed out, those that need more NVENC's and do NOT need HEVC B Frame / Ray Tracing / Tensor support would be better off purchasing one or more refurbished Quadro P5000/P6000's or Tesla P4/P40's (with 2x NVENC's). And if they don't need 8K HEVC, even refurbished Quadro GP100's or Tesla P100's (with 3x NVENC's) might be a good choice if the price is justified. For others that want a mix of the newer technologies though, I'd likely recommend at least one Turing-based card, but the others in a system could be Pascal, again, depending on the need for NVENC's.

The main problem with speed is when you use HQ preset for HEVC (which is needed for low resolution channels as it will enable 8x8 CU instead of 16x16 CU for Medium), it will give you only 1/4 performance of P5000 on any Turing Card (600 fps vs 150 fps at 1080p).

Second problem is that if you wan't only NVENC there is no need to buy anything better than Quadro RTX 2000 (which is not yet released) as all Quadro GPU has same NVENC speed. We liked model when we pay more for P5000 to have 2xNVENC instead of one in P4000.

Currently we use Supermicro servers with 4xGPU, but with this new generation we will need 4x times more GPU, yes they could be cheaper (RTX 4000 or RTX 2000 when released), but we will need to change all our servers to something like SuperServer 6049GP-TRT which can handle 20 GPU or have 4x more servers, this will introduce other problems, from our internal tests we find that it is not very stable to use more than 4 GPU in one server.

Quality increase was expected as it is now year 2019, but we didn't expect such drop in performance.

This will make GPU NVENC solution much more expensive and when AMD will release new Epyc 2 CPUs there will be no difference between GPU and CPU transcoding performance, speed of 1 Turing NVENC HQ preset =< 1 AMD Epyc2 32cores at libx265 Medium preset.

These are all valid points. You would think NVIDIA would consider making cards similar to Teslas (but specialized just for video applications) that offer multiple NVENC's / NVDEC's without all the other features at a lower price point. NVIDIA really needs to consider your point about the cost of purchasing Epyc 2's versus Quadro RTX 2000's / 4000's. While it might make sense at low-scale (mobile / desktop), as you said the cost isn't justified for workstations / servers, especially beyond 4 GPUs in a single system.

In my case, I use a video switcher application that only supports Intel QuickSync and NVIDIA NVENC/NVDEC. I'm considering the purchase of one or more refurbished P5000's (for around US$ 1,250 each), and adding a Turing GPU after the Quadro RTX 2000 is released once I have a justified need for the features Turing offers.

Great discussion team, which is the best transcoding card I could buy to install in my super micro server for transcoding? Right now am using M6000 and want to upgrade so I can transcode more AVC services in 1080p.

Dear Roman, thanks for the interesting results. A couple of questions:
1. Why did you run x264 without lookahead option for "High quality? Hard to compare quality of encoders when one of them is started with different options
2. For x264 you set -threads 4. But your CPU is "Dual Intel Xeon E5-2660v3 @ 2.6 GHz" where CPU has 10 physical cores. I'd say that "-threads 10" looks more appropriate here for performance compassion.

Hello Vasily,
Thank you for the kind talk.

>Why did you run x264 without lookahead option for "High quality?
libx264 uses 40 frames lookahead by default in medium preset, so there's no need to specify that.

>I'd say that "-threads 10" looks more appropriate
We've observed some time ago that for bigger amount of threads, libx264 sometimes produce bitstream with bitrate being lower than it's set from CLI. It's not a big deal for the desktop CPUs, but for, say, 20 threads on a server-grade CPU it really becomes an issue. I've not checked if this is fixed in more recent libx264 releases, however.

OK, I see, thanks. I've raised threading question since you should see another FPS with 10-20 threads what impacts the diagram about number of simultaneous streams for x264.

Hello, possible to use h264_nvenc with -profile:v baseline -level 3.0 ?