Jetson AGX h.265 encode latency

Good morning!

I am working on Jetson AGX application for my client. The core of it consists of four cameras. Each one is being captured @30FPS using Argus, and then YUV images of the frames are fed into NvEncoder. There is shared frame pool so the interface is zero-copy. NvEncoder is configured to create h.265 stream with our custom configuration (slowest encoding preset, and other specifics). The application is very latency-sensitive, we strive to get lowest latency possible.

Now, the first version of the application consisted of four separate processes, each one capturing video from one camera using one Argus session, and feeding one encoder. Like this:

  • Wait for the next frame from camera session
  • Feed the frame from the encoder
  • Wait for bitstream to be available at the encoder output
    When we measured encoder latency (basically the time it took to complete two last steps) it was pretty stable at 10ms or so.

Right now we are working on second version of the application. Among the other things, we would like the option to software synchronise the cameras, so right now the process can create Argus session with two or three cameras, and the frames are fed to separate encoder instances at the same time. For example with two frames I have:

  • Wait for next set of frames from recoding session
  • Feed frame 0 into encoder 0
  • Feed frame 1 into encoder 1
  • Wait for the bitstream to be available from encoder 0 output
  • Wait for the bitstream to be available from encoder 1 output

Now, when we measure the encoding latency for stream 0 and stream 1 we are getting about 15ms between feed 0 and output 0, and some 4+ms more before output 1 becomes available. To be honest, this is a serious problem for us, because we thought that h.265 video encoder of Jetson AGX can operate in parallel, ie we expected both stream 0 and stream 1 bitstream output to be available after 10ms.

At first, we thought that we are doing something wrong (single-process camera encoders reported 10ms after all). We did a lot of testing and it turned out that 10ms latency is only possible if hardware is solely dedicated to one process. When single process encoders do their work at the same time, we are getting very similar latency pattern, with one of the encoders taking more time to encode.

We also performed throughput testing, and it seems that AGX can peform some 230 frames/second for our stream configuration, so something like 4ms/frame (obviously when working in parallel, in saturated fashion).

So our conclusions are:

  • At least part of the hardware encoder is “shared” between separate NvEncoder instances, even if they are in separated process. That shared part causes physical encoding to become “serialised”, and the encoder that starts encoding later has to wait for the encoder that started earlier
  • Therefore, in order to get lowest latency possible for single frame, only single encoder instance should encode at any given time.
  • On the other hand, getting maximum throughput requires several encoders with enough frames supplied so that the encoding engine will be busy all the time.

I will be very grateful for any comments, hints or explanations related to the above. Especially with respect to how to get the lowest frame latency possible for several h.265 streams at the same time.

Best
Michal

Hi,
Please run the script to enable system in maximum performance mode:
VPI - Vision Programming Interface: Performance Benchmark

And also enable max perf mode to encoder. There are two hardware encoder engines in Xavier. Please run sudo tegrastats to make sure both are enabled in your application.

For Argus + video encoding, you may refer to the sample:

/usr/src/jetson_multimedia_api/samples/10_camera_recording

For multi video encoding, please refer to

/usr/src/jetson_multimedia_api/samples/15_multivideo_encode

Thanks for the reply! I did the following:

  1. Tried the encoding in both --max and --restore modes. Performance seems unaffected.
  2. Made sure I was calling setMaxPerfMode(1) on all encoders. Yeah, I was. When I tried doing the same with setMaxPerfMode(0) it was much slower.
  3. Did sudo tegrastats when the encoder was running and got the following:
RAM 4559/31930MB (lfb 38x4MB) SWAP 302/15965MB (cached 1MB) CPU [20%@1190,16%@1190,1%@1190,1%@1267,2%@1341,2%@1343,7%@1550,10%@1574] EMC_FREQ 4%@2133 GR3D_FREQ 0%@318 NVENC 1075 NVENC1 1075 VIC_FREQ 0%@1036 APE 150 MTS fg 0% bg 7% AO@38C GPU@37.5C iwlwifi@40C Tdiode@40.75C PMIC@50C AUX@38.5C CPU@39C thermal@38.35C Tboard@36C GPU 309/309 CPU 773/773 SOC 5566/5566 CV 0/0 VDDRQ 463/463 SYS5V 3295/3295```
I assume that `NVENC 1075 NVENC1 1075` means that I have two encode engines running at 1075Hz each.
Unfortunately the problem persists...

4) I know and have studied both examples. I will instrument 15_multivideo_encode to test if it suffers the same performance penalty (increased latency) when encoding more than one stream.

Thanks!
Michał

Thanks, @DaneLLL. A couple of follow-up questions:

  1. Is the VIC freq tied to NVENC freq, i.e., assigning maxfreq to vicfreqctrl will also maximize the nvenc encoder freq on the Jetson AGX Xavier? Or are there different parts in the /sys/devices tree to write to to maximize performance of the nvenc encoder specifically?
  2. How many streams can be processed truly in parallel by the hardware encoder engines? Is that two, because of the two engines, or can each engine also encode multiple streams truly in parallel?
  3. We ran 15_multivideo_encode with the default ULTRAFAST encoder preset, followed by another run with the SLOW encoder setting, and measure execution time with the time tool, like this:
$ time multivideo_encode num_files 4 ~/file.yuv 1920 1080 H265 ~/file1.h265 ~/file.yuv 1920 1080 H265 ~/file2.h265 ~/file.yuv 1920 1080 H265 ~/file3.h265 ~/file.yuv 1920 1080 H265 ~/file4.h265

For different values of num_files, the results are below:

Ultrafast, num_files=1: real 0m12.918s
Ultrafast, num_files=2: real 0m10.291s
Ultrafast, num_files=3: real 0m12.071s
Ultrafast, num_files=4: real 0m12.295s

Code changed to preset “slow”, and re-ran multivideo_encode as above:

Slow, num_files=1: real 0m25.962s
Slow, num_files=2: real 0m26.466s
Slow, num_files=3: real 0m35.490s
Slow, num_files=4: real 0m47.256s

It appears that the SLOW preset is not able to process more than 2 streams in parallel. Are there specific reasons (HW chokepoints) for the SLOW versus ULTRAFAST setting that would explain this behavior?

  1. For the ULTRAFAST, FAST, MEDIUM, and SLOW presets, are you able to share what these settings individually enable/disable for the encoder?
  2. Should the FAST & MEDIUM settings have better parallel performance? What do you lose with those compared to SLOW?

Thanks!

Hi,
The presets are for different performance and quality. Certain encoding filters for enhancing video quality are disabled for better throughput in ULTRAFAST and FAST mode. You would need to select a mode to balance video quality and encoding performance.

And please also try CBR + setting virtual buffer size:
Random blockiness in the picture RTSP server-client -Jetson TX2 - #5 by DaneLLL

Thanks, @DaneLLL. I understand that the presets are different encoder tradeoffs. My questions 1-5 above were more specific than that and asked for additional detail or comment on our observations. Could you please help with answering those?

Hi,
For the queries:

  1. VIC frequency is not tied to encoder. For running VIC engine at maximum frequency, please run the script:
    VPI - Vision Programming Interface: Performance Benchmark
  2. The capability of encoding/decoding is stated in module data sheet:
    Log in | NVIDIA Developer
      1. The presets are the modes with different video quality and encoding throughout. Please fine a mode fitting your use-case. And suggest run with CBR + setting virtual buffer size

Thanks. We still have some remaining questions, and require some more detail.

  1. For JAX there is the following line for NVENC in the datasheet:
4K60 (4) | 4K30 (8) | 1080p60 (16) | 1080p30 (32)

The number in brackets is “Maximum Number of Streams”.

  • What exactly does that “Maximum Number of Streams” mean? Are these the streams that can be processed truly in parallel without performance penalty?
  • Does this number in brackets assume any specific preset like ULTRAFAST?
  1. In my post above I showed with 15_multivideo_encode how the performance varies between 1x, 2x, 3x, and 4x parallel encodes of 1080p30 with the SLOW preset:

Slow, num_files=1: takes 0m25.962s to finish.
Slow, num_files=2: takes 0m26.466s to finish.
Slow, num_files=3: takes 0m35.490s to finish.
Slow, num_files=4: takes 0m47.256s to finish.

In light of this my question remains: does the SLOW prefix have some choke points where parallel processing between streams is not possible? For example, some additional CPU processing apart from the HW NVENC?

Hi,
The capability in the data sheet is run under ULTRAFAST preset. if you would like to achieve the capability, please set to ULTRAFAST. In SLOW preset, certain enhancing-video-quality filters are enabled, so it takes more time to encode single frame, and has worse throughput in single and simultaneous multiple encoding tasks.

Thanks, @DaneLLL. Is there more detailed documentation available which additional filters or functions are enabled when going from ULTRAFAST to FAST to MEDIUM and to SLOW? Or is this proprietary knowledge and not shared?

Secondly, do all the additional filters for enhancing video quality in SLOW mode run on NVENC HW alone, or do they involve other HW elements of the Xavier as well (CPU, VIC, etc.)?

Hi,
The detail of hardware presets is private and we are not able to share it. And it is in NVENC only. No other hardware engines are involved.

Hi DaneLLL

To be honest, I always though these were documented somewhere and I just recalled, where:
https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/SD/Multimedia/AcceleratedGstreamer.html
(it is longish document, please search for SlowPreset).

It breaks down Fast/Medium/Slow presets into varying motion vector precision in ME engine. To be more precise, Fast is said to support full-pel motion vector resolution, Medium half-pel and Slow quarter-per. It also lists varying levels of support for Intra vs Inter and specific Intra modes.

From the Intra modes list as well as available MV precisions, I would guess that the information applies to h.264, and older (2003) version of said standard. Newer h.264 versions (FRExt) has 8x8 Intra modes not listed there. And h.265 has different modes as well as octapel MV precision. Still, it is a help.

Best
Michal

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.