AGX Orin camera issues

Hi,

I’m debugging camera issues on AGX Orin 32GB with Jetpack 5.1.3 on a custom mainboard. We have a few Orin modules out of dozens on which the camera regularly stops working and cannot recover, shows usually one of the following traces depending on the module in use

[ 4332.591788] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 131072
[ 4332.616797] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
[ 4332.641986] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 131072
[ 4332.666860] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
with ftrace log showing
kworker/2:2-1458 [002] … 2419.364993: rtcpu_nvcsi_intr: tstamp:76386848248 class:GLOBAL type:STREAM_NOVC phy:0 cil:0 st:2 vc:0 status:0x00000001
kworker/2:2-1458 [002] … 2419.364993: rtcpu_nvcsi_intr: tstamp:76386848248 class:GLOBAL type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000006
kworker/2:2-1458 [002] … 2419.364994: rtcpu_nvcsi_intr: tstamp:76386848248 class:CORRECTABLE_ERR type:STREAM_NOVC phy:0 cil:0 st:2 vc:0 status:0x00000001
kworker/2:2-1458 [002] … 2419.364994: rtcpu_nvcsi_intr: tstamp:76386848248 class:CORRECTABLE_ERR type:STREAM_VC phy:0 cil:0 st:2 vc:0 status:0x00000006

or

[ 290.338933] [RCE] ERROR: camera-ip/vi5/vi5.c:745 [vi5_handle_eof] “General error queue is out of sync with frame queue. ts=314796051360 sof_ts=314796213952 gerror_code=2 gerror_data=600064 notify_bits=30000”
[ 292.956415] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 292.957735] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 292.962227] (NULL device *): vi_capture_control_message: NULL VI channel received
[ 292.962451] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=4, csi_port=4
[ 292.962757] (NULL device *): vi_capture_control_message: NULL VI channel received
[ 292.963268] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel

or

[ 344.371620] [RCE] VM0 deactivating.VM0 activating.VM0 deactivating.VM0 activating.BUG: core/watchdog/heartbeat-task.c:162 [heartbeat_halt_execution] “*** RCE WATCHDOG FAILURE: HALTING ***”
[ 344.381599] tegra186-cam-rtcpu bc00000.rtcpu: Alert: Camera RTCPU gone bad! restoring it immediately!!
[ 346.949656] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 346.950744] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 346.955222] (NULL device *): vi_capture_control_message: NULL VI channel received
[ 346.955445] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=2, csi_port=2
[ 346.955737] (NULL device *): vi_capture_control_message: NULL VI channel received

or

[ 350.339953] tegra194-vi5 13e40000.host1x:vi1@14c00000: capture control message timed out
[ 351.363445] tegra194-vi5 13e40000.host1x:vi1@14c00000: capture control message timed out
[ 351.363712] tegra194-vi5 13e40000.host1x:vi1@14c00000: csi_stream_release: failed to disable nvcsi tpg on stream 2 virtual channel 0
[ 352.386938] tegra194-vi5 13e40000.host1x:vi1@14c00000: capture control message timed out
[ 352.387198] tegra194-vi5 13e40000.host1x:vi1@14c00000: vi_capture_release: release channel IVC failed

We have tried some of the tricks mentioned on other related threads, e.g. boosting Jetson and locking the vi and nvcsi (we are not using jetson’s isp) clocks, increasing pix_clk (our sensor setup doesn’t have a serializer-deserializer), and changing timings and delays on our sensor driver. Issues seems to be tied to specific Orin modules as replacing a module on the mainboard with a known working one fixes aforementioned issues, this would suggest that the mainboard is ok. The faults can be reproduced systematically.

Any ideas or suggestions on how to proceed will be appreciated

hello jetson-developer,

according to below…

may I know what’s the difference, or SKUs of those working/non-working modules.

Hello JerryChang,

At least two of the faulty modules have product part number 699-13701-0004-500 P.0 from EEPROM, order sheet shows part number 900-13701-0040-000. We also have working units with the same SKU and revision + working units from revisions G.0 and A.0.

hello jetson-developer,

may I have more details about the camera module you’re using.
is it a YUV camera sensor? since you’re not using Jetson’s ISP, neither working with a SerDes chip.

We are using a custom cameraboard outputting 10-bit RAW from ~12MP grayscale sensor at ~40 fps over D-PHY 1.1, multiple cameras connected to a single module with a custom FPC, The cameraboards have also been crosstested with known working Orins, cameras work as expected with those.

hello jetson-developer,

so, it always the specific Orin modules cannot enable camera streaming.
is such issue related to multi-cam as well?

Doesn’t seem to be related to having multiple cameras on our system, our application uses regularly two simultaneously streaming cameras and I cannot recall these kinds of issues affecting more than one camera per module, moreover the errors aren’t transferred with a problematic camera if swapped with the device’s other cameras.

Typically cameras can be initialized and enabled as per usual and streaming works fine for some time. The problems start a bit later, especially NULL VI channel received errors seems to appear more or less randomly after a while somewhere on our application’s streaming sequence. Depending on the module NULL VI channel errors may end up in kernel panic crashing the device, this cannot be reproduced consistently on all of the modules though. tegra-camrtc-capture-vi tegra-capture-vi: corr_err error log can be reproduced reliably on one of the modules and at least on that module the issue have been isolated to failure in starting a stream on one of the cameras from which device cannot recover, stopping the stream and retrying or even reinitializing the sensor doesn’t seem to help.

Any ideas on how to proceed?

hello jetson-developer,

is it a DPHY or CPHY sensor?
may I know what’s the data-rate it’s running with? please also evaluate whether it’s approaching the ISP throughput.

Hi JerryChang,

Our sensor boards output D-PHY over four lanes on separate bricks with datarate of ~1.3Gbps per lane (well below max 2.5Gbps or even 1.5Gbps per lane so should’t require descew calibration). Maximum concurrent thoughput directly via VI (we are not using ISP) would be around 10Gbps max for two camera use case, won’t exceed 20Gbps at any point. Failures have been recorded while using only one camera streaming at ~5Gbps total throughput.

please try apply pre-built update from Topic 284939 to enable infinite timeout property. let’s check whether it helps with your issues.
you may see-also developer guide to enable Infinite Timeout Support.

Our application utilizes VI directly via V4L2, we have tried increasing CAPTURE_TIMEOUT_MS on vi5_fops and also setting timeout to be infinite, but that does not seem to help.

hello jetson-developer,

let’s dig into low-level driver for more details.
please follow below steps to enable VI tracing logs.

echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 30720 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/tegra_rtcpu/enable
echo 1 > /sys/kernel/debug/tracing/events/freertos/enable
echo 2 > /sys/kernel/debug/camrtc/log-level
echo > /sys/kernel/debug/tracing/trace
cat /sys/kernel/debug/tracing/trace

Hi,

Traces from RCE WATCHDOG FAILURE module attached. Tested also setting timeout to infinity as dmesg shows request timeout with this module, in this case device was left hanging as would be expected based on our earlier tests.

rce_trace.log (1.0 MB)
rce_trace_full.log (9.4 MB)

hello jetson-developer,

please give it a try with the kernel patch from Topic 258971 for adding semaphore.

Semaphore patch doesn’t seem to work for our problem as I’m still getting

[ 298.849263] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
[ 298.999499] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 131072
[ 299.049497] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 131072
[ 299.124488] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
[ 299.149592] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
[ 299.249743] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
[ 299.274764] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 131072
[ 299.299883] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 131072
[ 299.324793] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 131072

constantly after streaming for a while

Tested the semaphore patch on another faulty device, no luck with that either.

[  345.421109] tegra186-cam-rtcpu bc00000.rtcpu: Alert: Camera RTCPU gone bad! restoring it immediately!!
[  345.423320] [RCE] VM0 deactivating.VM0 activating.VM0 deactivating.VM0 activating.BUG: core/watchdog/heartbeat-task.c:162 [heartbeat_halt_execution] "*** RCE WATCHDOG FAILURE: HALTING ***"

may I also confirm which CSI brick you’re used?

We had seen issues on SCILs 0 to 2 (CSI 0 to 5 / AB, CD and EF). SCIL 1 problem can be easily replicated on one of the modules, the issues does not seem to affect one brick over others.

I tested the semaphore fix on third faulty module and got

[ 1883.968405] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 4194404
[ 1883.993429] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 4194404
[ 1884.018468] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 1, flags: 0, err_data 4194404
[ 1884.043486] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 2, flags: 0, err_data 4194404

ViNotifyErrorTag on camrtc_capture.h would give

CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_STREAM_FIFO_OVERFLOW
CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_FRAME_RESERVED_1
CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_FRAME_PXL_ENABLE_FAULT
CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_FRAME_FS_FAULT

instead of err_data 131072

CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_FRAME_CSI_FAULT_PD_CRC_ERR

on two other modules.

Am I reading the mask correctly?

hello jetson-developer,

the err_data content it depends on the value of CaptureStatusCodes.
please see-also $public_sources/kernel_src/kernel/nvidia/include/soc/tegra/camrtc-capture.h for capture_status.
so, err_data=4194404 it represent below…
CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_FRAME_PXL_ENABLE_FAULT
CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_FRAME_RESERVED_1
CAPTURE_STATUS_NOTIFY_BIT_CSIMUX_STREAM_FIFO_OVERFLOW