Does Jetson AGX Orin NVENC Support Sub-Frame (Line-by-Line) Encoding Without Full Frame Buffering? (FPGA via PCIe Input)

Hello NVIDIA Jetson Community,

We are evaluating Jetson AGX Orin for a low-latency video encoding (H.264 \ H.265) application and have a specific question about the NVENC hardware encoder’s capabilities.

Use Case

We have an FPGA connected to Jetson AGX Orin via PCIe that captures and streams video data. The FPGA delivers pixel data line-by-line (scanline-by-scanline) over PCIe — meaning the complete frame is not immediately available on the Orin side but arrives progressively as the sensor reads out.

Minimizing end-to-end encode latency is critical. Ideally, we would like the NVENC encoder to begin encoding as soon as enough scanlines (e.g., a CTU row) have been DMA’d from the FPGA into Orin’s memory, rather than waiting for the entire frame to be fully transferred and assembled.

Architecture Overview

Collapse

Copy

1

[Sensor] → [FPGA] ------ PCIe DMA (line-by-line) -------> [Jetson AGX Orin memory] → [NVENC] → [encoded bitstream]

The FPGA transfers scanlines (or groups of scanlines) into Orin’s DRAM via PCIe DMA as they become available. Currently, we must wait for the full frame to be assembled in a buffer before submitting it to the encoder, which adds up to one full frame period of latency.

Questions

  1. Sub-frame input support: Does the NVENC hardware block on Jetson AGX Orin support any mode where encoding can begin on partial frame data (e.g., a set of scanlines, CTU rows, or slices) before the entire frame is available in the buffer?

  2. API-level support: Is there any mechanism — via V4L2, NvMedia, or another low-level API — for feeding scanlines or groups of lines incrementally to the encoder, so it can start processing CTU/macroblock rows in parallel with the ongoing PCIe DMA transfer from the FPGA?

  3. Frame-based confirmation: Or is the encoder strictly frame-based, requiring a fully populated image buffer to be queued (e.g., via V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE) before encoding begins?

  4. PCIe DMA considerations: Are there any recommended DMA buffer layouts or memory allocation strategies (e.g., NvBuf, GBM, dma-buf) that could help minimize latency when ingesting video data from an FPGA over PCIe and passing it to NVENC?

  5. Alternative approaches: If sub-frame encoding is not supported, what is the recommended approach to minimize encode latency in this pipeline? For example:

    • Encoding smaller “pseudo-frames” (horizontal strips) independently?

    • Using slice-based encoding with one slice per CTU row?

    • Any NVIDIA-recommended low-latency pipeline architecture for progressive PCIe input?

  6. Future support: Is sub-frame or line-level encoding planned for any future JetPack / L4T release?

What We Have Already Reviewed

  • Jetson Linux Multimedia API documentation

  • V4L2 video encoder samples (01_video_encode)

  • NVIDIA Video Codec SDK documentation

  • GStreamer encoder plugin documentation (nvv4l2h264enc / nvv4l2h265enc)

  • PCIe endpoint and DMA documentation for Jetson AGX Orin

We did not find any references to sub-frame or line-level encoding in these resources, but wanted to confirm whether this capability exists at the hardware level or is accessible through a lower-level interface.

Any insights, documentation references, or suggested workarounds would be greatly appreciated. We are open to using low-level APIs if they enable this capability. Thank you in advance!

Motti

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, it appears that the NVENC hardware encoder on Jetson AGX Orin does not support sub-frame or line-level encoding. The encoder seems to be strictly frame-based, requiring a fully populated image buffer to be queued before encoding begins.

The documentation and resources provided, including the Jetson Linux Multimedia API documentation, V4L2 video encoder samples, NVIDIA Video Codec SDK documentation, GStreamer encoder plugin documentation, and PCIe endpoint and DMA documentation for Jetson AGX Orin, do not mention sub-frame or line-level encoding capabilities.

However, there are a few potential workarounds to minimize encode latency in this pipeline:

  1. Encoding smaller pseudo-frames: Divide the frame into smaller horizontal strips and encode them independently. This approach can help reduce the latency, but it may also affect the encoding efficiency and quality.
  2. Slice-based encoding: Use slice-based encoding with one slice per CTU row. This approach can help parallelize the encoding process, but it may also require additional processing and memory allocation.
  3. Using NvBuf or GBM: Use NvBuf or GBM (Graphics Buffer Manager) to manage the DMA buffer layouts and memory allocation. These APIs can help optimize the memory allocation and reduce the latency.
  4. Optimizing the DMA transfer: Optimize the DMA transfer from the FPGA to the Jetson AGX Orin by using techniques such as DMA scatter-gather, DMA burst mode, or PCIe peer-to-peer transfer.

It’s worth noting that the VIC-based nvvidconv on Jetson supports various RAW-YUV input formats for scaling, including I420, UYVY, YUY2, YVYU, NV12, NV16, NV24, P010_10LE, GRAY8, BGRx, RGBA, and Y42B. However, this does not imply that the NVENC hardware encoder supports sub-frame or line-level encoding.

In summary, while the NVENC hardware encoder on Jetson AGX Orin does not seem to support sub-frame or line-level encoding, there are potential workarounds to minimize encode latency in this pipeline. Further investigation and experimentation with the provided APIs and techniques may be necessary to achieve the desired latency and encoding efficiency.

Comparison of the VIC-based nvvidconv and CUDA-based nvvidconv:

  • VIC-based nvvidconv supports more RAW-YUV input formats for scaling, including I420, UYVY, YUY2, YVYU, NV12, NV16, NV24, P010_10LE, GRAY8, BGRx, RGBA, and Y42B.
  • CUDA-based nvvidconv supports fewer RAW-YUV input formats for scaling, including I420, NV12, P010_10LE, GRAY8, BGRx, and RGBA.
  • VIC-based nvvidconv is used for VIC-based scaling, while CUDA-based nvvidconv is used for CUDA-based scaling.

It’s not clear which approach is more suitable for the specific use case, as it depends on the specific requirements and constraints of the application. However, the VIC-based nvvidconv seems to support more RAW-YUV input formats, which may be beneficial for certain use cases.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,
You can enable slice output by configuring the options in 01_video_encode sample:

        -slt <type>           Slice length type (1 = Number of MBs, 2 = Bytes) [Default = 1]
        -slen <length>        Slice length [Default = 0]
        --sle                 Slice level encode output [Default = disabled]

Examples:
How to set v4l2_enc_hw_tuning_info_type in the xavier encoder? - #9 by DaneLLL

Input frame data has to be full frames, so please do like:
Streaming NAL Unit of a slice as soon as it is generated by the h264 encoder - #10 by DaneLLL

Thank you for the response. I noted the disclaimer:

Please note that this reply is generated by LLM automatically

I appreciate the transparency. However, since my question concerns a specific hardware capability of the NVENC block on Jetson AGX Orin, an automatically generated response cannot provide the level of certainty required for an architectural decision. I am therefore requesting review and confirmation from an NVIDIA engineer or knowledgeable community member.


1. The Core Question Remains Unanswered

My question is whether the NVENC hardware is architecturally frame-based, or whether it internally processes at CTU-row (or slice) granularity but this capability is simply not exposed via current APIs.

An LLM can summarize public documentation, but it cannot confirm undocumented or hardware-level behavior. I came to the forum specifically to get a definitive answer beyond what the public docs provide.


2. Our Specific Setup

In our system:

  • A sensor feeds video into an FPGA

  • The FPGA injects the video into Jetson AGX Orin over PCIe

  • The FPGA performs line-by-line PCIe DMA transfers into Orin DRAM

  • Lines are delivered progressively in real time as the sensor scans the frame

Critically, the FPGA does not buffer a complete frame before transferring it. Each scanline is DMA’d to Orin as soon as it is available from the sensor. This means the accumulation of a full frame in Orin memory takes approximately the entire frame period.


3. Why This Matters: The Real Latency Problem

At 50 FPS, the frame period is:

1 / 50 = 20 ms per frame

Because scanlines are injected line-by-line over PCIe in real time, a full frame takes approximately the entire 20 ms frame period to accumulate in Orin’s memory

The dominant latency contributor is frame accumulation, not encoding. This is not a burst DMA scenario — the data arrives progressively over the full frame interval.


4. What Would Solve the Problem

The only architectural change that would significantly reduce latency is sub-frame encoding, where NVENC begins processing CTU rows as soon as enough scanlines have been injected by the FPGA.

Ideal sub-frame pipeline:

FPGA injects lines 0–63 → CTU row 0 ready → NVENC begins encoding row 0

FPGA injects lines 64–127 → CTU row 1 ready → NVENC encodes row 1 (while DMA continues)

FPGA injects last lines → last CTU row → NVENC finishes almost immediately

                                              after last scanline arrives

For a 1080p frame with 64-line CTU rows one CTU row accumulation = 64 / 1080 × 20 ms ≈ 1.2 ms

If NVENC could start encoding after the first CTU row:

Latency ≈ ~1–2 ms + encode tail time

instead of ≈ 20 ms + full encode time

That is a dramatic, order-of-magnitude reduction — and a fundamental architectural difference for our system.


5. Clarification Needed on Suggested Workarounds

Slice-Based Encoding

If configured with one slice per CTU row, a critical point needs confirmation: does the V4L2 encoder still require the entire frame buffer to be fully populated before QBUF submission?

If yes, then slices only affect bitstream structure and decoder-side latency — they do not reduce encoder-input latency, which is the metric we need to optimize.

Pseudo-Frames (Horizontal Strips)

Encoding smaller strips independently would reduce accumulation delay, but at the cost of:

  • Broken inter-frame prediction across strip boundaries

  • Significantly reduced compression efficiency

  • Custom bitstream handling on the decoder side

Is this a supported and tested approach? Are there benchmarks or official guidance on the quality/efficiency tradeoff?

NvBuf / GBM and DMA Optimization

These are buffer allocation and management mechanisms. They do not address the fundamental question of whether encoding can begin before a full frame is assembled. Could you clarify how these specifically reduce encoder-input latency when the FPGA is injecting scanlines progressively over PCIe?

VIC-Based vs. CUDA-Based nvvidconv

The comparison of pixel format support does not appear related to sub-frame encoding latency. Could you clarify its relevance to this discussion?


6. What We Specifically Need Confirmed

We respectfully request confirmation from a human NVIDIA engineer:

Option Implication
✅ NVENC supports sub-frame encoding — it can begin processing CTU rows before the full frame is available, via some API or low-level mechanism. We can pursue a true low-latency pipeline where encoding overlaps with line-by-line DMA from our FPGA.
❌ NVENC strictly requires a complete frame buffer before encoding begins (hardware limitation). We must accept ~20 ms accumulation latency at 50 FPS or fundamentally redesign the system architecture.

Additionally:

  1. If relevant details exist in the Technical Reference Manual (TRM) under NDA, please advise how we can obtain access.

  2. If NVIDIA has a recommended low-latency architecture for setups like ours — where an FPGA injects video line-by-line over PCIe into Orin’s memory — a pointer to reference designs or application notes would be invaluable.


Summary

In our setup, an FPGA injects video into Orin’s DRAM line-by-line via PCIe DMA, spread across the full 20 ms frame period at 50 FPS. The frame accumulation time dominates our latency budget. The only meaningful way to reduce this is if NVENC can begin encoding before the full frame has been injected by the FPGA.

The LLM-generated response, while well-intentioned, cannot provide the authoritative hardware-level confirmation this question requires.

Can NVENC on Jetson AGX Orin begin encoding before a complete frame buffer is available — specifically, can it start processing CTU rows as they are progressively DMA’d from our FPGA over PCIe?

A definitive yes or no will allow us to finalize our system design.

Thank you for your time and support.

Regards,

Motti

Thank you for the clarification.

Enabling --sle will help reduce output-side latency — understood and appreciated.

What is the equivalent of --sle option in GStreamer nvenc-based pipeline?

However, the core issue remains: “Input frame data has to be full frames” means our main bottleneck — the ~20 ms frame accumulation delay (FPGA injecting lines over PCIe into Orin DRAM at 50 FPS) — is unchanged.

Summary of our understanding:

  • ✅ Slice-level output reduces encoder->decoder latency

  • ❌ Sub-frame (partial-frame) input is not supported

One key question remains: Is there any planned capability or recommended architecture for NVENC to accept progressive/partial-frame input to reduce end-to-end latency in FPGA-to-encoder pipelines?

We’ll proceed assuming full-frame input is required and architect accordingly unless advised otherwise.

Regards,

Motti

Hi,

This is correct. Please implement your use-case based on this. Encoding sub-frame input is not supported.

Thank you DaneLLL.