Jetson AGX h.265 encode latency

michal34 · April 30, 2024, 8:26am

Good morning!

I am working on Jetson AGX application for my client. The core of it consists of four cameras. Each one is being captured @30FPS using Argus, and then YUV images of the frames are fed into NvEncoder. There is shared frame pool so the interface is zero-copy. NvEncoder is configured to create h.265 stream with our custom configuration (slowest encoding preset, and other specifics). The application is very latency-sensitive, we strive to get lowest latency possible.

Now, the first version of the application consisted of four separate processes, each one capturing video from one camera using one Argus session, and feeding one encoder. Like this:

Wait for the next frame from camera session
Feed the frame from the encoder
Wait for bitstream to be available at the encoder output
When we measured encoder latency (basically the time it took to complete two last steps) it was pretty stable at 10ms or so.

Right now we are working on second version of the application. Among the other things, we would like the option to software synchronise the cameras, so right now the process can create Argus session with two or three cameras, and the frames are fed to separate encoder instances at the same time. For example with two frames I have:

Wait for next set of frames from recoding session
Feed frame 0 into encoder 0
Feed frame 1 into encoder 1
Wait for the bitstream to be available from encoder 0 output
Wait for the bitstream to be available from encoder 1 output

Now, when we measure the encoding latency for stream 0 and stream 1 we are getting about 15ms between feed 0 and output 0, and some 4+ms more before output 1 becomes available. To be honest, this is a serious problem for us, because we thought that h.265 video encoder of Jetson AGX can operate in parallel, ie we expected both stream 0 and stream 1 bitstream output to be available after 10ms.

At first, we thought that we are doing something wrong (single-process camera encoders reported 10ms after all). We did a lot of testing and it turned out that 10ms latency is only possible if hardware is solely dedicated to one process. When single process encoders do their work at the same time, we are getting very similar latency pattern, with one of the encoders taking more time to encode.

We also performed throughput testing, and it seems that AGX can peform some 230 frames/second for our stream configuration, so something like 4ms/frame (obviously when working in parallel, in saturated fashion).

So our conclusions are:

At least part of the hardware encoder is “shared” between separate NvEncoder instances, even if they are in separated process. That shared part causes physical encoding to become “serialised”, and the encoder that starts encoding later has to wait for the encoder that started earlier
Therefore, in order to get lowest latency possible for single frame, only single encoder instance should encode at any given time.
On the other hand, getting maximum throughput requires several encoders with enough frames supplied so that the encoding engine will be busy all the time.

I will be very grateful for any comments, hints or explanations related to the above. Especially with respect to how to get the lowest frame latency possible for several h.265 streams at the same time.

Best
Michal

DaneLLL · May 2, 2024, 1:56am

Hi,
Please run the script to enable system in maximum performance mode:
VPI - Vision Programming Interface: Performance Benchmark

And also enable max perf mode to encoder. There are two hardware encoder engines in Xavier. Please run sudo tegrastats to make sure both are enabled in your application.

For Argus + video encoding, you may refer to the sample:

/usr/src/jetson_multimedia_api/samples/10_camera_recording

For multi video encoding, please refer to

/usr/src/jetson_multimedia_api/samples/15_multivideo_encode

michal34 · May 6, 2024, 7:19pm

Thanks for the reply! I did the following:

Tried the encoding in both --max and --restore modes. Performance seems unaffected.
Made sure I was calling setMaxPerfMode(1) on all encoders. Yeah, I was. When I tried doing the same with setMaxPerfMode(0) it was much slower.
Did sudo tegrastats when the encoder was running and got the following:

RAM 4559/31930MB (lfb 38x4MB) SWAP 302/15965MB (cached 1MB) CPU [20%@1190,16%@1190,1%@1190,1%@1267,2%@1341,2%@1343,7%@1550,10%@1574] EMC_FREQ 4%@2133 GR3D_FREQ 0%@318 NVENC 1075 NVENC1 1075 VIC_FREQ 0%@1036 APE 150 MTS fg 0% bg 7% AO@38C GPU@37.5C iwlwifi@40C Tdiode@40.75C PMIC@50C AUX@38.5C CPU@39C thermal@38.35C Tboard@36C GPU 309/309 CPU 773/773 SOC 5566/5566 CV 0/0 VDDRQ 463/463 SYS5V 3295/3295```
I assume that `NVENC 1075 NVENC1 1075` means that I have two encode engines running at 1075Hz each.
Unfortunately the problem persists...

4) I know and have studied both examples. I will instrument 15_multivideo_encode to test if it suffers the same performance penalty (increased latency) when encoding more than one stream.

Thanks!
Michał

philipp12 · May 6, 2024, 7:22pm

Thanks, @DaneLLL. A couple of follow-up questions:

Is the VIC freq tied to NVENC freq, i.e., assigning maxfreq to vicfreqctrl will also maximize the nvenc encoder freq on the Jetson AGX Xavier? Or are there different parts in the /sys/devices tree to write to to maximize performance of the nvenc encoder specifically?
How many streams can be processed truly in parallel by the hardware encoder engines? Is that two, because of the two engines, or can each engine also encode multiple streams truly in parallel?
We ran 15_multivideo_encode with the default ULTRAFAST encoder preset, followed by another run with the SLOW encoder setting, and measure execution time with the time tool, like this:

$ time multivideo_encode num_files 4 ~/file.yuv 1920 1080 H265 ~/file1.h265 ~/file.yuv 1920 1080 H265 ~/file2.h265 ~/file.yuv 1920 1080 H265 ~/file3.h265 ~/file.yuv 1920 1080 H265 ~/file4.h265

For different values of num_files, the results are below:

Ultrafast, num_files=1: real 0m12.918s
Ultrafast, num_files=2: real 0m10.291s
Ultrafast, num_files=3: real 0m12.071s
Ultrafast, num_files=4: real 0m12.295s

Code changed to preset “slow”, and re-ran multivideo_encode as above:

Slow, num_files=1: real 0m25.962s
Slow, num_files=2: real 0m26.466s
Slow, num_files=3: real 0m35.490s
Slow, num_files=4: real 0m47.256s

It appears that the SLOW preset is not able to process more than 2 streams in parallel. Are there specific reasons (HW chokepoints) for the SLOW versus ULTRAFAST setting that would explain this behavior?

For the ULTRAFAST, FAST, MEDIUM, and SLOW presets, are you able to share what these settings individually enable/disable for the encoder?
Should the FAST & MEDIUM settings have better parallel performance? What do you lose with those compared to SLOW?

Thanks!

DaneLLL · May 7, 2024, 1:42am

Hi,
The presets are for different performance and quality. Certain encoding filters for enhancing video quality are disabled for better throughput in ULTRAFAST and FAST mode. You would need to select a mode to balance video quality and encoding performance.

And please also try CBR + setting virtual buffer size:
Random blockiness in the picture RTSP server-client -Jetson TX2 - #5 by DaneLLL

philipp12 · May 7, 2024, 2:00am

Thanks, @DaneLLL. I understand that the presets are different encoder tradeoffs. My questions 1-5 above were more specific than that and asked for additional detail or comment on our observations. Could you please help with answering those?

DaneLLL · May 7, 2024, 3:15am

Hi,
For the queries:

VIC frequency is not tied to encoder. For running VIC engine at maximum frequency, please run the script:
VPI - Vision Programming Interface: Performance Benchmark
The capability of encoding/decoding is stated in module data sheet:
Log in | NVIDIA Developer
1. 1. The presets are the modes with different video quality and encoding throughout. Please fine a mode fitting your use-case. And suggest run with CBR + setting virtual buffer size

philipp12 · May 7, 2024, 3:54am

Thanks. We still have some remaining questions, and require some more detail.

For JAX there is the following line for NVENC in the datasheet:

4K60 (4) | 4K30 (8) | 1080p60 (16) | 1080p30 (32)

The number in brackets is “Maximum Number of Streams”.

What exactly does that “Maximum Number of Streams” mean? Are these the streams that can be processed truly in parallel without performance penalty?
Does this number in brackets assume any specific preset like ULTRAFAST?

In my post above I showed with 15_multivideo_encode how the performance varies between 1x, 2x, 3x, and 4x parallel encodes of 1080p30 with the SLOW preset:

Slow, num_files=1: takes 0m25.962s to finish.
Slow, num_files=2: takes 0m26.466s to finish.
Slow, num_files=3: takes 0m35.490s to finish.
Slow, num_files=4: takes 0m47.256s to finish.

In light of this my question remains: does the SLOW prefix have some choke points where parallel processing between streams is not possible? For example, some additional CPU processing apart from the HW NVENC?

DaneLLL · May 7, 2024, 5:43am

Hi,
The capability in the data sheet is run under ULTRAFAST preset. if you would like to achieve the capability, please set to ULTRAFAST. In SLOW preset, certain enhancing-video-quality filters are enabled, so it takes more time to encode single frame, and has worse throughput in single and simultaneous multiple encoding tasks.

philipp12 · May 7, 2024, 9:06pm

Thanks, @DaneLLL. Is there more detailed documentation available which additional filters or functions are enabled when going from ULTRAFAST to FAST to MEDIUM and to SLOW? Or is this proprietary knowledge and not shared?

Secondly, do all the additional filters for enhancing video quality in SLOW mode run on NVENC HW alone, or do they involve other HW elements of the Xavier as well (CPU, VIC, etc.)?

DaneLLL · May 7, 2024, 11:24pm

Hi,
The detail of hardware presets is private and we are not able to share it. And it is in NVENC only. No other hardware engines are involved.

michal34 · May 8, 2024, 12:36pm

Hi DaneLLL

To be honest, I always though these were documented somewhere and I just recalled, where:
https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/SD/Multimedia/AcceleratedGstreamer.html
(it is longish document, please search for SlowPreset).

It breaks down Fast/Medium/Slow presets into varying motion vector precision in ME engine. To be more precise, Fast is said to support full-pel motion vector resolution, Medium half-pel and Slow quarter-per. It also lists varying levels of support for Intra vs Inter and specific Intra modes.

From the Intra modes list as well as available MV precisions, I would guess that the information applies to h.264, and older (2003) version of said standard. Newer h.264 versions (FRExt) has 8x8 Intra modes not listed there. And h.265 has different modes as well as octapel MV precision. Still, it is a help.

Best
Michal

system · May 23, 2024, 3:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Simultaneous use of two HEVC encoders on a Jetson Xavier NX Jetson Xavier NX encoder	13	1629	October 18, 2021
Multiple H.265 video encoding Jetson TX1	7	3875	May 19, 2017
performance limitation in backend multimedia api sample Jetson AGX Xavier	9	518	October 18, 2021
Encoding performance issue on Xavier AGX, but no problem on Nano (simultaneous encoding) Jetson AGX Xavier gstreamer , nvbugs	7	1616	September 19, 2022
Latency issue: nvv4l2h265enc accumulates four images before releasing the first Jetson AGX Xavier encoder	7	2092	March 3, 2022
How to get 1080p h.264 encoding with 60 fps on Jetson TK1 Jetson TK1	27	18493	October 13, 2016
Jetson Multimedia API Zero-frame Latency Streaming Jetson AGX Xavier decoder , encoder	6	1427	September 19, 2021
Question about encoding a USB camera stream to H.264 and decode it directly gstreamer Jetson Xavier NX gstreamer	5	514	May 1, 2024
Trying to get ultra low live-streaming latency(<100ms) on the drone using nano Jetson Nano	23	17146	September 29, 2022
OpenCV application uneven frame times Jetson Xavier NX opencv , performance , opencl	14	2799	January 19, 2022

Jetson AGX h.265 encode latency

Related topics