Camera RTCPU crash: task "isp5-irq" has been detected as OOB - halting execution

I’m working on improving the error resiliency of our camera stack, and have constantly ran into the RTCPU crashing.

It seems to happen when the log level is increased to 2 (echo 2 > /sys/kernel/debug/camrtc/log-level), when some sorts of MIPI error occurs.

[  199.164546] [RCE] VM0 deactivating.VM0 activating.ERROR: core/watchdog/heartbeat-task.c:195 [WatchdogCallbackTaskOutOfBounds] "task "isp5-irq" has been detected as OOB - halting execution"
[  199.164559] [RCE] BUG: core/watchdog/heartbeat-task.c:162 [heartbeat_halt_execution] "*** RCE WATCHDOG FAILURE: HALTING ***"
[  199.210299] tegra186-cam-rtcpu bc00000.rtcpu: Alert: Camera RTCPU gone bad! restoring it immediately!!
[  201.749753] imx530 36-001a: v4l2sd_stream++ enable 0
[  201.829712] imx530 36-001a: camera_common_dpd_enable: csi 6
[  201.829723] imx530 36-001a: camera_common_dpd_enable: csi 7
[  201.829727] imx530 36-001a: camera_common_mclk_disable: disable MCLK

When the same variety of error happens and the loglevel is 0, the firmware does not crash like this.

I’m attempting to augment the vi driver to be able to give counts of various errors (CRC, single bit rec, correctable), and found that increasing the log-level is the only way that this is exposed to the host CPU. I’m then intercepting the CSIMUX_FRAME and CORRECTABLE_ERR messages and storing counters.

Is there a better way to do this?

hello russell9,

may I know which Jetpack public release version you’re now using?
what’s the actual use-case for testing error resiliency?

Yeah, sorry should have included that.

We’re essentially using 35.6.2 for everything relevant here (it’s technically 35.5.0 with the VI firmware from 35.6.2 & all of the VI-related fixes cherry-picked into the kernel)

The use case is we’ve been seeing occasional failed frames and are trying to characterize the system so we can improve our cabling/SI/EMI etc.

hello russell9,

I would like to have more details.
for instance, please refer to Camera Architecture Stack.
may I know what’s the capture pipeline? do you using v4l2 standard IOCTL, or, you’re using libargus.

We are using argus through libargus.

hello russell9,

it’s callback invoked by the WDT (watchdog timeout) framework when a task is detected as OOB (out-of-bounds) and it halts execution.
a task executing slowly will trigger this. it may due to you’ve increasing log level, which outputting lots of debug messages to effect the process.

Ok, so are you saying that log-level is not really a usable mechanism? Is log level 1 safe?

Out of bounds doesn’t really sound like a timing thing (maybe a queue getting full tho…) and seems strange that it would bring down the whole firmware.

hello russell9,

it’s enabling logs for debug into the issue instead of actual use-case for camera functionality tests.