Tegra-capture-vi corr_err occurring in JetPack 5.1.2

I am using an OV9281 grayscale MIPI camera (CSI 2 lane, global shutter) with an AverMedia DL131L carrier board, which is very similar to the Orin Nano dev. kit. The camera works for a while (minutes to hours) but then stops producing new frames. The dmesg log shows tegra-capture-vi errors when this happens, such as:

[ 2318.292444] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 10538, flags: 0, err_data 4194402
[ 2386.939657] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 18819, flags: 0, err_data 4194402

If I restart my application then the camera starts working again. I am using JetPack 5.1.2 on a Jetson Orin Nano. I’ve used the same sensor in the past on a Xavier NX with JetPack 4.6.3 with no issues. The application uses Video For Linux 2 rather than libargus because it is a grayscale camera (Y10 format), not a Bayer color camera and libargus does not support grayscale cameras.

I need the camera to work reliably for long periods of time. Is there a typical reason this error occurs and is there something that can be done about it? (e.g. adjusting device tree parameters, recovery from the error, etc.) I tried increasing the number of queued buffers in V4L2 but the problem still occurs. I also tried boosting the vi, isp and nvcsi clocks to no avail. I have attached a dmesg log and trace log.

trace.log (4.5 MB)
dmesg.log (82.9 KB)

Looks like there have CRC error (err_intr_stat_pd_crc_err_vc0) and start lose the FE package then cause the failed.

[001] ....  1194.236796: rtcpu_nvcsi_intr: tstamp:37820517966 class:GLOBAL type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000004
     kworker/1:1-45      [001] ....  1194.236796: rtcpu_nvcsi_intr: tstamp:37820517966 class:CORRECTABLE_ERR type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000004
     kworker/1:1-45      [001] ....  1194.236797: rtcpu_vinotify_error: tstamp:37820693991 cch:0 vi:1 tag:CSIMUX_FRAME channel:0x00 frame:1905 vi_tstamp:1210261941856 data:0x0000077100400062
     kworker/1:1-45      [001] ....  1194.236797: rtcpu_nvcsi_intr: tstamp:37820776462 class:GLOBAL type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000004
     kworker/1:1-45      [001] ....  1194.236797: rtcpu_nvcsi_intr: tstamp:37820776462 class:CORRECTABLE_ERR type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000004
     kworker/1:1-45      [001] ....  1194.236798: rtcpu_nvcsi_intr: tstamp:37820777441 class:GLOBAL type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000004
     kworker/1:1-45      [001] ....  1194.236798: rtcpu_vinotify_error: tstamp:37820954815 cch:0 vi:1 tag:CSIMUX_FRAME channel:0x00 frame:1906 vi_tstamp:1210270231776 data:0x0000077200400062
     kworker/1:1-45      

err_intr_stat_pd_crc_err_vc0

I see. So if I understand correctly, the status:0x00000004 on the first line refers to the contents of the NVCSI_STREAM_0_CORRECTABLE_ERR_INTR_STATUS_VC0_0 register from the TRM. Since bit 2 is set, that means CRC error err_intr_stat_pd_crc_err_vc0. Thank you for the information. A CRC error could be caused by noise on the CSI lines, I presume, but unfortunately I don’t think my scope is fast enough to detect this. Are there camera parameters (or multimedia complex parameters) in the device tree that could cause this kind of error if they are not tuned correctly? (or that could be tuned to minimize the likelihood of errors?)

Also, the second line seems to suggest that this is a correctable error. Are there ways in V4L2 to detect and recover from a CRC error detected in NVCSI? I can afford to lose a frame or two, but not to have the camera freeze until the application is restarted, so recovering from a frame lost to a CRC error would be an acceptable solution.

Maybe you can check monitor the REG NVCSI_STREAM_0_CORRECTABLE_ERR_INTR_STATUS_VC0_0 to re-initialize the sensor to have VI wait to recovery but it could lost not just two frames.

Thank you. I will think about how to monitor that register but in the meantime, I noticed from section 2.5 of the Orin TRM that it says:

2.5 Line/Frame CRC Check
CSI protocol natively supports the line CRC check for all types included in the embedded line packets and pixel line packets. NVCSI follows the CSI specification v2.0 to implement the standard packet level CRC check. NVCSI only processes the packet level (line level) information. It does not have any view of the frame. VI or software handles the Frame CRC. When line CRC error is detected, a fault interrupt line asserts to the hardware safety manager (HSM). The line CRC error can be set to correctable fault or uncorrectable fault by safety software. The line CRC error means that there is one or more pixels incorrect in a line. Hardware reset is not required for this error if it is just a transient error in NVCSI and the next frame is correct from sensor, it can recover automatically. Software can set a threshold of this CRC error. If the CRC error continues to occur on multiple lines, this may indicate a problem in the receiver or in the transmission lines and a hardware reset of NVCSI may be required. This error status is also sent to VI, and the safety software can get error status from either the NVCSI fault or VI notifies.

Is there a threshold that can be set somewhere in the device tree to allow a certain number of these “correctable errors” without stopping the camera? I also notice in section 2.1.1 that end-of-frame signals can be automatically generated by the hardware if necessary according to a timeout, or forced manually. Are there device tree entries for these kinds of options? Sorry to keep asking about the device tree but the camera interface at the application level is V4L2 while the error you pointed out is at the very low hardware level. I thought the NVIDIA drivers for the multimedia complex might have some options to help with the handling of “correctable” errors automatically. I still find it puzzling that this camera worked fine on a Xavier NX (with JetPack 4.6.3) but I’m getting these errors on an Orin Nano (with JetPack 5.1.2).

Sorry to tell.
Current don’t have any configure from device tree for this kind of case.

While this is disappointing, thank you for the update. I appreciate you taking the time to look into it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.