TX2 CSI to memory latency

A number of topics, listed below, have discussed glass to glass latency for TX1 and TX2. At this point, it seems that observed latency for the TX1 is less than for TX2. Several discussions noted that decreasing the buffer queue size decreases the latency. No discussion fully explains the observed latencies over the expected 2 or 3 frames for TX2, except possibly for the post in …1026587… by Nvidia.
Are there currently any full explanations for the observed latencies?

A current industrial application under development uses multiple CSI sensors with raw 10 bit monochrome output format at various frame sizes and frame rates of at least 60fps. The application requires single frame CSI to computational buffer latency.
The question is: Can the TX2 and V4l2 can be configured to support single frame CSI to buffer latency?

In attempting to measure latency, one would expect to be able to use the sof and eof event timers described in the Video Input chapter (27) of the Parker TRM. As post …1046381… discussed, it is not clear how the timestamps align with the real time clock. Nor is it clear which of the sof timestamps queued by the NOTIFY block is the one embedded in the captured buffer, nor is there any evident access to the other queued timestamps.
Following this approach yields an unrealistically low latency on taking the difference between the SOF time stamp embedded in the buffer and the value of TSC read when the buffer becomes available in user space.
Is there a means of accessing all of the timestamps captured by NOTIFY?
What alignment do these timestamps have to any other clock?

A related question.
Should CHANSEL_LOAD_FRAMED appear in an RTCPU trace of a normal frame, or does it indicate an error?
Specifically, does this flag indicate a premature (early) load command, executed before a frame ends?

https://devtalk.nvidia.com/default/topic/1026587/jetson-tx2/csi-latency-is-over-80-milliseconds-/1

https://devtalk.nvidia.com/default/topic/1046381/jetson-tx2/v4l2_buf_flag_tstamp_src_soe-clock_realtime-timestamping-for-v4l2-frames-/1

https://devtalk.nvidia.com/default/topic/934387/jetson-tx1/one-frame-latency-delay-in-tx1-v4l-stack/post/5060671/#5060671

https://devtalk.nvidia.com/default/topic/1023668/jetson-tx1/nvcamerasrc-queue-size-property-limitation/1

https://devtalk.nvidia.com/default/topic/1029472/jetson-tx2/low-latency-camera-for-jetson-tx2/1

Sorry no really understand the first question.
The CHANSEL_LOAD_FRAMED should be a message to tell should not an error.

Hello ShaneCCC

It is not clear which question that you found unclear, so I will restate both.

Many threads have asked about the minimum achievable latency with the TX2. Most of these ask about latency from the sensor to a display. This question asks about the latency from the CSI input to availability of the image frame in a buffer. Does the TX2 support single frame latency? That is from the arrival of the first pixel at the VI, what is the delay to the availability for processing of a buffer containing the full frame?

Quoting Parker TRM section 27.7.1 Notifications “LOAD_FRAMED (many-channel) is emitted when a LOAD command is received for a channel while that channel is currently in a frame.” It seems that this means that LOAD_FRAMED is an error signal that does not occur during normal frame processing, and that it occurs only of a the channel receives a LOAD command while the channel is still receiving a frame, that is, before the channel generates the end of frame signal.
Is this this correct interpretation of LOAD_FRAMED?
Is CHANSEL_LOAD_FRAMED the v4l2 copy of LOAD_FRAMED?
Assuming that CHANSEL_LOAD_FRAMED is an error signal, what improper configurations could produce this signal?

Full frame’s period (i.e. from the arrival of the first pixel at the VI, till Frame End arrives VI); this could be millisecond level as in a 60fps case, frame period is 16ms

If the CHANNEL_COMMAND REG programing LOAD during VBLANK (i.e. after FE), then LOAD_FRAME is not expected and should be reported as an error.
Else if LOAD configure in the middle of frame (i.e. before FE), then LOAD_FRAME is expected and should not
be treated as an error.

By LOAD_FRAME do you mean LOAD_FREMED?

Assuming so, then please explain what you mean by ‘reported as an error’, and ‘treated as an error’?
Do you mean that if a LOAD command occurs during VBLANK and LOAD_FRAMED is emitted, then the VI itself has behaved incorrectly? And, if a LOAD command occurs before FE, then the VI should emit LOAD_FRAMED?

Asking this directly, asking the question that needs a clear answer:. We are seeing LOAD_FRAMED. Does this indicate that a LOAD command occurred before FE? Are there circumstances in which VI would emit LOAD_FRAMED when the LOAD command occurred after FE?

Also the main question is:
Can the TX2 and V4l2 can be configured to support single frame CSI to buffer latency?

If the LOAD_FRAMED event report before the FE and LOAD was programing that means an error report by vi notify.

If you means the sensor only output single frame and stop that not support currently support streaming mode.

This is what we see:

  1. ATOMP_FS
  2. CHANSEL_PXL_SOF
  3. CHANSEL_LOAD_FRAMED
  4. CHANSEL_PXL_EOF
  5. ATOMP_FE
    Does this indicate an error?

Your answer raised several questions:
What sets LOAD? Does a Tegra driver set LOAD? Can you specify which part of the driver sets LOAD?
We use only streaming video, but to understand how the vi interacts with the drivers, I need to ask: What restricts video to streaming mode, the VI hardware, or the driver?

The LOAD set by the vi driver also the restricts the streaming mode by it. Have a reference to the vi4_fops.c

Please an answer to the other questions in the last message.
Also, the most important questions are the two questions about latency in the first message. Please provide an answer to those questions.

Please have specific the remain question again.

Hi ShaneCCC,

RidgeRun is also interested on the lowest capture latency possible on the TX2 (1 frame), this is from t0=sensor_puts_image_on_CSI_inputs to t1=frame_is_available_in_user_space. Do you know what is the lowest latency possible using v4l2 (bypassing the ISP)?

Some questions related to this and to the way to measure this latency based in timestamps of the events from the capture subsystem:
  1. What elements could be adding latency in the TX2 capture subsystem if we use V4L2?
  2. When using V4L2, there is some buffering done by the main V4L2 driver, do you know if it can be configured in the TX2 to achieve a single frame latency? Is it possible to do it even when using multiple sensors?
  3. Is there a way to access the timestamps posted by the NOTIFY block of the video input beyond end of frame?
  4. What alignment do the NOTIFY stamps have to clocks accessible from Linux? Is there a way to correlate both clocks, the one in the RTCPU and the system clock?
  5. Please explain the significance of CHANSEL_LOAD_FRAMED in the RTCPU trace of a captured frame.
  6. Evidently, some process must write to the LOAD bit of the VI_CH_CHANNEL_COMMAND_0 to initiate the capture of every frame, at least for normal capture in which the AUTOLOAD bit is not set. Is this correct? If so, which process running on which processor? In particular, is this a frame by a frame by frame process of v4l2?
  7. Is it possible to configure the TX2 so that data from each frame flows from video inputs to memory, then gets processed by GPUs without having any interaction with any Linux processes until the result of the processing is available from the GPUs?

Thanks,
-David

  1. For the vi mode the latency to the memory is millisecond level, however the buffer queue to the user space may depend on buffer handling of the v4l2 framework and VI kernel driver.

  2. –set-ctrl low_latency_mode=1 help on the buffer return. And support multiple sensors.

  3. The timestamps should be the NOTIFY of the SOF and didn’t public this source currently.

  4. Have a reference to this topic https://devtalk.nvidia.com/default/topic/1056202

  5. Should be able find in TRM

  6. It’s frame by frame process.

  7. It’s could be a good suggestion. However current implement couldn’t support. Currently you may consider to user userptr for this kind of case.

Follow up questions, referencing the numbered responses:

  1. Thank you for your answer. Can you elaborate; is the latency in the v4l2 framework or in the VI kernel driver?
  2. Does this support single frame latency for multiple sensors?
  3. OK, the source is unavailable publicly. But, is there any way to get time stamp data beyond SOF from NOTIFY?
  4. OK, so is this problem solved in Jetpack 4.2.x?
  5. The TRM explains the hardware meaning of CHANSEL_LOAD_FRAMED, but it does not explainits significance in the trace of a captured frame. Specifically, does CHANSEL_LOAD_FRAMED in a captured frame indicate and error?
  6. By which processor / process, the RTCPU or e.g. the VI kernel driver?
  7. Why couldn’t the current implimintation support data flow from memory directly to the GPU?
  1. Thank you for your answer. Can you elaborate; is the latency in the v4l2 framework or in the VI kernel river?

What’s I said early is the VI kernel driver.

  1. Does this support single frame latency for multiple sensors?

Really don’t understand the question. Have real use case should be help.

  1. OK, the source is unavailable publicly. But, is there any way to get time stamp data beyond SOF from NOTIFY?

RTCPU will send the SOF timestamp to VI driver and VI driver will transfer it to the kernel time.

  1. OK, so is this problem solved in Jetpack 4.2.x?

The latest release r32.3.1 have included it.

  1. The TRM explains the hardware meaning of CHANSEL_LOAD_FRAMED, but it does not explainits significance in the trace of a captured frame. Specifically, does CHANSEL_LOAD_FRAMED in a captured frame indicate and error?

Current VI driver program the LOAD sequence it could be an error.

  1. By which processor / process, the RTCPU or e.g. the VI kernel driver?

It’s the FW run on another co-operate CPU can’t see it in main system.

  1. Why couldn’t the current implimintation support data flow from memory directly to the GPU?

Have a reference to multimedia_api sample 12_camera_v4l2_cuda for the buffer transfer to GPU.

Hello ShaneCCC,

Thank you for your answers.
Following up:

  1. Since the VI kernel driver causes the latency in the TX2 image capture system under v4l2, would using the linux REEEMPT-RT patch, and/ or other other real time extensions reduce latency and / or make latency more predictable?

  2. Please consider a system with 3 image sensors, each using 4 MIPI CSI lanes. This will use all 12 available CSI lanes. All sensors will capture images continuously, although not necessarily simultaneously. The goal is single frame latency for all 3 image streams; every frame buffer for all 3 sensors should be available for processing (well) before the sensor finishes sending the next frame. Can the VI support this operation?
    2.1 A related question. Can you say what is the maximum aggregate data transfer rate that the VI will support from the 12 CSI lanes to memory, particularly for this case of 3 sensors of 4 lanes each?

  3. From our conversation, I understand that although NOTIFY collects many timestamps through the VI NOTIFY data path, only the SOF timestamp is available through the VI driver. Does the RTCPU send the other timestamps which the VI Driver then ignores, or does the RTCPU send only the SOF timestamp?
    3.1 According to Chapter 27 of the Parker TRM, NOTIFY collects many other timestamps, event signals and fault signals. Is there any way to access these?
    3.2 According to Chapter 27 of the Parker TRM, CSIMUX, CHANSEL and ATOMP all produce some version of SOF. Which one of these is the SOF that the VI kernel reports?

  4. Thank you.

  5. Please explain, under what circumstance would CHANSEL_LOAD_FRAMED occur but not indicate an error?

  6. As I understand, you explained that the LOAD command is a frame by frame process run by a processor that is not accessible from the linux CPU. Is this correct?
    6.1 If so, can you say which processor?
    6.2 And, more importantly, what interaction does the VI kernel driver have with the process or processor that runs the LOAD command?

  7. Thank you. To clarify, do you mean that the sample directory in https://developer.nvidia.com/embedded/L4T/r28_Release_v3.2/t186ref_release_aarch64/Tegra_Multimedia_API_R28.3.2_aarch64.tbz2 contains 12_camera_v4l2_cuda, which illustrates buffer transfer to the GPU?

  1. Not sure, but I think the improvement should be less.
  2. Those three sensors are running at different thread that should be the same with one sensor case.
    2.1 I suppose you are concern about the MIPI rate instead VI to memory.
  3. RTCPU send a lot of event but for the timestamp current only take the SOF as timestamp.
    3.1 Current only the trace log.
    3.2 It could be the time of CSIMUX receive the SOF package from sensor.
  4. Can’t tell any example now.
  5. No, the VI driver run on linux CPU and it can access by the linux CPU too. Have a check the vi4_fops.c
  6. Yes

Hello ShaneCCC,

2.1 Parker TRM table 366 lists peak NNVCSI bandwidth of 30gbps. Is this for MIPI to VI? What is the maximum aggregate transfer rate from VI to memory?
3.1 What is the source of data for the trace log? Can you point to some documnetation?
3.2 I have some evidence that the SOF is incorrect. I will present that evidence later.

Thank you,
M.Reich

2.1 Why do you concern the VI to memory? It could be about the memory bandwidth and it will transfer to memory line by line not frame by frame that tell the VI to memory won’t be bottleneck for this.

3.1 Have a check …/kernel/nvidia/drivers/platform/tegra/rtcpu/vi-notify.c

Hi ShaneCCC,

I think that the ultimate question is:

a) Is it possible to get 1 frame latency (this means a single frame of latency) from t0=sensor_puts_image_on_CSI_inputs to t1=frame_is_available_in_user_space in TX2?

If so, <b>b)</b> how can we measure that it is only one frame? This is why it is important how the RTCPU puts the timestamp in SOF and if there are other timestamps that can be used to measure this.  Moreover, if this is not possible, <b>c)</b> is this because a bandwidth limitation on VI?

From ShaneCCC answers, my understanding is that the VI hardware module has enough bandwidth to move the data fast, however, the VI driver/V4L2 uses a queue (FIFO I believe) to pass the captured buffers to user space and this FIFO adds latency. The task is then to reduce the FIFO to its minimum size (more threads don't help), ideally use user-pointer at the V4L2 level and then pass that buffer directly to the GPU. <b>d)</b> Is there some way to tell VI to pass lines as they are filled and not wait for a whole buffer?

-David

Currently the VI mode low_latency_mode=1 would be 9x ms for 30fps output. That could be the best for current VI FIFO buffering. However you can try to modify the VI driver to get one frame latency.