Jetson H264 Hardware Encoder: Lower Latency with Higher Input FPS and Multi-Stream Encoding

I’ve been profiling the Jetson H264 Hardware Encoder on 64GB AGX Orion.

video_encode_main.cpp.zip (17.6 KB)

It’s interesting that encoding latency significantly decreases with higher video input FPS.

Profiling Setting:

  • Single 1080P video stream based on high movement wolf documentary
  • Codec config: --stats --blocking-mode 1 --sp -br 20000000 -pbr 30000000 --max-perf -hpt 1 -nbf 0 -nrf 1 -ifi 30 -idri 30 -rc cbr -mem_type_oplane 3 -mem_type_cplane 2 --report-metadata

When video frames are immediately available after previous frame is encoded (gstreamer is-live=false, and samples/01_video_encode) encoder can easily achieves 150fps / 7ms-per-frame latency.

However, when video frames are provided to encoder in real-time, per-frame encoding latency becomes significantly higher. For instance, under 30FPS setting, video frames are fed into encoder every 33ms - we observed ~20ms per-frame latency.

----------- Element = enc0 -----------

Total Profiling time = 9.5781

Average FPS = 31.3214

Total units processed = 301

Average latency(usec) = 19972

Minimum latency(usec) = 49

Maximum latency(usec) = 22186

-------------------------------------

To simulate 30 FPS video feed, we modified samples/01_video_encode/video_encode_main.cpp, which is attached in zip folder. The modified code allows simulating different Real-Time video feed FPS, by change TARGET_FPS constant from 30 to 60/120/200.

Very interestingly, per-frame encoding latency keeps dropping when video input FPS increases. Seems like, encoding latency is significantly lower when HW encoder is busier.

60FPS → 16.5ms

----------- Element = enc0 -----------

Total Profiling time = 4.64771

Average FPS = 64.5479

Total units processed = 301

Average latency(usec) = 16537

Minimum latency(usec) = 46

Maximum latency(usec) = 19399

-------------------------------------

120FPS → 10.8ms

----------- Element = enc0 -----------

Total Profiling time = 2.33117

Average FPS = 128.691

Total units processed = 301

Average latency(usec) = 10882

Minimum latency(usec) = 3186

Maximum latency(usec) = 12735

-------------------------------------

200FPS (essentially same as original video_encode_main) → 7.1ms

----------- Element = enc0 -----------

Total Profiling time = 1.45988

Average FPS = 205.496

Total units processed = 301

Average latency(usec) = 7144

Minimum latency(usec) = 2022

Maximum latency(usec) = 12750

-------------------------------------

I made sure maxing out the system clock, but encoding latency is still ~20ms for 30FPS. **Are there anything else I can try in order to have 7ms encoding latency for 30FPS real-time feed?
**
Thank you!

*** Please note that this reply is generated by LLM automatically ***

To address the issue of high encoding latency for 30FPS real-time feed on the Jetson AGX Orin, several factors need to be considered. The provided profiling data indicates that the encoder can achieve 150fps with 7ms per frame latency when video frames are immediately available after the previous frame is encoded. However, when video frames are provided to the encoder in real-time, the per-frame encoding latency becomes significantly higher, reaching 20ms for 30FPS.

Several potential reasons for this discrepancy can be identified:

  1. Input buffer size and management: The encoder’s input buffer size and management might be causing the increased latency. When video frames are immediately available, the encoder can process them quickly, but when frames are provided in real-time, the encoder might be waiting for the input buffer to fill up before processing, leading to increased latency.

  2. Encoder configuration: The provided codec configuration statistics show that the encoder is configured for constant bitrate (CBR) mode with a bitrate of 20,000,000 and a peak bitrate of 30,000,000. This configuration might not be optimal for low-latency encoding. Consider adjusting the encoder configuration to prioritize low latency over bitrate.

  3. System clock and resource utilization: Although you mentioned that you maxed out the system clock, it’s essential to verify that the system resources (e.g., CPU, memory, and GPU) are not being bottlenecked. Ensure that the system is not experiencing any resource contention that could be contributing to the increased latency.

  4. Encoder implementation and optimization: The encoder implementation and optimization might be a limiting factor. Consider using a more optimized encoder implementation or exploring alternative encoding algorithms that are designed for low-latency applications.

To achieve 7ms encoding latency for 30FPS real-time feed, you can try the following:

  1. Adjust the encoder configuration: Experiment with different encoder configurations, prioritizing low latency over bitrate. This might involve adjusting the bitrate, GOP size, or other encoding parameters.

  2. Optimize input buffer management: Investigate the input buffer size and management to ensure that it’s optimized for low-latency encoding. Consider using a smaller input buffer size or implementing a more efficient buffer management strategy.

  3. Profile and optimize system resources: Use profiling tools to identify any system resource bottlenecks and optimize the system configuration to minimize resource contention.

  4. Explore alternative encoder implementations: Research and evaluate alternative encoder implementations that are designed for low-latency applications. This might involve using a different encoding algorithm or a more optimized encoder implementation.

  5. Consider using a hardware-accelerated encoder: If the Jetson AGX Orin has a hardware-accelerated encoder, consider using it instead of the software-based encoder. Hardware-accelerated encoders are often optimized for low-latency applications and can provide better performance.

In comparison to other Jetson modules, the Jetson AGX Orin has a more powerful GPU and CPU, which should provide better performance for encoding tasks. However, the actual performance will depend on the specific use case, encoder configuration, and system resource utilization.

To simulate different real-time video feed FPS, you can modify the TARGETFPS constant in the provided code. However, keep in mind that this might not accurately reflect the actual performance of the encoder in a real-world scenario.

In summary, to achieve 7ms encoding latency for 30FPS real-time feed on the Jetson AGX Orin, you should investigate and optimize the encoder configuration, input buffer management, system resources, and encoder implementation. Additionally, consider exploring alternative encoder implementations and hardware-accelerated encoders to achieve better performance.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Thanks LLM. I’ve tried all options above, but still not seeing 7ms encoding latency with 30FPS video feed.

Hi,
It looks like encoder_capture_plane_dq_callback gets impacted by source frame rate. Please refer to the sample for implementation:

> /usr/src/jetson_multimedia_api/samples/unittest_samples/encoder_unit_sample

To have a dq_thread() instead of registering for a dq_callback.

Above comment is not correct.

Hi @DaneLLL , thanks for the reply!

I used V4L2_BUF_FLAG_TIMESTAMP_COPY to measure encoding latency inside encoder_unit_sample, but it shows average 50ms. I realized encoder_unit_sample lacks the NvElement APIs to turn on all the optimizations..

QQ:

  1. Do you have any working script proving <10ms encoding latency for 30FPS real-time feed?

  2. Instead of using encoder_unit_sample, should I try non-blocking mode or 01_video_encode sample?

Hi,
Sorry the previous comment is not correct. In 01 sample, the dq thread is actually created in NvV4L2ElementPlane.cpp. So both samples are in same implementation.

This is due to low-level mechanism. Hardware encoder keeps one reference frame, so after next frame is queued to output plane, you can dequeue the encoded data from capture plane. There is one-frame latency.

Hi @DaneLLL , thanks for the reply.

To clarify my understanding, you meant the HW encoder releases encoded frames, when the output plane containing reference frame is extracted? This is a bit counterintuitive, because there are multiple output planes - so the extracted output plane doesn’t necessarily contain the reference frame. Is there any lower level ioctl() operations that forces capture plane to deliver frame ASAP?

Another interesting behavior is that, when running multiple encoding streams, first stream’s encoding latency becomes lower.

Is it possible that, when HW encoder is busier, encoding becomes faster?

I’m testing on Orion - I wonder would HW encoder be improved on Thor.

Appreciate you.

Hi,
Please apply this and see if it helps:

~$ sudo su
/home/user# cd /sys/devices/platform/bus@0/13e00000.host1x/154c0000.nvenc/devfreq/154c0000.nvenc
/sys/devices/platform/bus@0/13e00000.host1x/154c0000.nvenc/devfreq//154c0000.nvenc# cat available_frequencies // get maximum frequency
/sys/devices/platform/bus@0/13e00000.host1x/154c0000.nvenc/devfreq//154c0000.nvenc# echo maximum_frequency > min_freq
/sys/devices/platform/bus@0/13e00000.host1x/154c0000.nvenc/devfreq//154c0000.nvenc# exit

@DaneLLL , oh wow, that max frequency command worked. Now 1080P has ~7ms latency!

Could you share some context behind it? Was HW encoder doing some energy saving trick? Are there any tradeoff keeping max freq besides energy? Where can I learn more about such low-level tricks?

Appreciate you so much!

Hi,
It looks like –max-perf does not take effect. It should be a bug and we will check further. As of now please execute the manual steps to keep frequency of NVENC.

@DaneLLL , thanks for the update.

I’m available to help debugging on my end if there’s a need.

Cheers.

Hi,
Our teams are checking it. As of now please adopt the quick solution.

1 Like

Hi,
We have checked and confirmed maximum performance mode is deprecated on Jetpack 6. Please execute the steps to enable it.
We will remove the option from the samples in next release.

Hi, thank you for the update. Is it possible to add back the max_perf option? From benchmarking, this is a critical setting to maximize HW encoder latency.

Hi,
Please set to maximum performance mode by configuring the device nodes.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.