Hi,
my company has a carrier board that attaches to 8 GMSL cameras via a MAX9295-MAX9296 serdes pipeline, and we are encountering issues that seem a little random when the kernel panics, originally I was using a 5.1.2 kernel and then after finding and trying out these patches I eventually moved to a 5.1.3 kernel which has proved to be more stable.
The kernel panics are a bit hard to narrow down what is going wrong but it seems to always stem from somewhere in the csi5/vi5_fops. An example panic being
[ 479.780563] Unable to handle kernel paging request at virtual address fffef249efbe51b0
[ 479.788738] Mem abort info:
[ 479.791613] ESR = 0x96000044
[ 479.794782] EC = 0x25: DABT (current EL), IL = 32 bits
[ 479.800286] SET = 0, FnV = 0
[ 479.803431] EA = 0, S1PTW = 0
[ 479.806703] Data abort info:
[ 479.809693] ISV = 0, ISS = 0x00000044
[ 479.813666] CM = 0, WnR = 1
[ 479.816721] [fffef249efbe51b0] address between user and kernel address ranges
[ 479.824092] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[ 479.829819] Modules linked in: fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrags
[ 479.829916] nct1008 ina3221 typec max96712 tegra_bpmp_thermal snd_soc_spdif_tx tegra210_adma snd_soc_rl6231 fzcam snd_hda_core userspace_alert fzcama]
[ 479.942957] CPU: 3 PID: 16162 Comm: vi-output, DS5 Not tainted 5.10.192a1 #20
[ 479.950387] Hardware name: Jetson AGX Orin Developer Kit (DT)
[ 479.956295] pstate: 80c00009 (Nzcv daif +PAN +UAO -TCO BTYPE=--)
[ 479.962476] pc : queued_spin_lock_slowpath+0x214/0x3c0
[ 479.967759] lr : queued_spin_lock_slowpath+0x20/0x3c0
[ 479.972950] sp : ffff8000329db940
[ 479.976360] x29: ffff8000329db940 x28: 0000000000000000
[ 479.981824] x27: ffff8000304eb998 x26: ffff1aaa71d90880
[ 479.987284] x25: ffff1aaa6f232530 x24: ffff1aab3cb6a400
[ 479.992745] x23: ffff1aaa6f232538 x22: ffff8000329db9d8
[ 479.998210] x21: ffff1aaa6e1a2c40 x20: 000000000fee65d4
[ 480.003673] x19: ffff1aaa6f232538 x18: 0000000000000010
[ 480.009133] x17: 0000000000000000 x16: ffffd798c8d250f0
[ 480.014592] x15: ffff1aaa6e1a31b0 x14: 653331203569762d
[ 480.020051] x13: 3439316172676574 x12: 3069763a78317473
[ 480.025512] x11: 6f682e3030303034 x10: 0000000000000000
[ 480.030979] x9 : 0000000000100000 x8 : 0000000000000000
[ 480.032157] tegra194-vi5 13e40000.host1x:vi0@15c00000: capture control message timed out
[ 480.036438] x7 : ffff1ab16cf31b00 x6 : ffff1aaa6e1a2c40
[ 480.036440] x5 : ffffd798ca747b00 x4 : ffff1ab16cf31b00
[ 480.036443] x3 : ffff1ab12549d690 x2 : ffffd798ca747b20
[ 480.036445] x1 : ffffd798ca747b20 x0 : ffff1ab16cf31b08
[ 480.036449] Call trace:
[ 480.036456] queued_spin_lock_slowpath+0x214/0x3c0
[ 480.036467] _raw_spin_lock+0x8c/0xa0
[ 480.036473] __mutex_unlock_slowpath.isra.0+0x154/0x1b0
[ 480.044838] Unable to handle kernel paging request at virtual address ffffdc7aca747b30
[ 480.050251] mutex_unlock+0x6c/0x80
[ 480.050260] vi_capture_ivc_send_control.isra.0+0xf4/0x130
[ 480.050262] vi_capture_control_send_message+0x7c/0x230
[ 480.050264] vi_capture_control_message+0xa0/0x140
[ 480.050267] csi5_send_control_message+0x54/0xb0
[ 480.050268] csi5_stream_open+0x98/0x110
[ 480.050271] csi5_start_streaming+0xac/0xe0
[ 480.050273] csi5_error_recover+0x44/0x90
[ 480.050281] tegra_csi_error_recovery+0x58/0xc0
[ 480.050292] tegra_csi_sync_event+0x44/0x50
[ 480.055758] Mem abort info:
[ 480.061206] vi5_channel_error_recover+0x1ac/0x200
[ 480.061209] tegra_channel_error_recover+0x78/0xa0
[ 480.061211] tegra_channel_kthread_capture_dequeue+0x130/0x1f0
[ 480.061215] kthread+0x148/0x170
[ 480.061218] ret_from_fork+0x10/0x18
[ 480.061228] Code: 910020e0 8b0200a2 f861d863 aa0203e1 (f8236827)
[ 480.066687] ESR = 0x96000044
[ 480.069209] ---[ end trace 6ec6b86511b39b1e ]---
[ 480.074148] EC = 0x25: DABT (current EL), IL = 32 bits
[ 480.083671] Kernel panic - not syncing: Oops: Fatal exception
[ 480.091480] SET = 0, FnV = 0
[ 480.095061] SMP: stopping secondary CPUs
[ 480.095079] Kernel Offset: 0x5798b8d00000 from 0xffff800010000000
[ 480.095080] PHYS_OFFSET: 0xffffe556c0000000
[ 480.095083] CPU features: 0x08040006,4a80aa38
[ 480.095084] Memory Limit: none
[ 480.219431] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
But it is common that the culprit memory access changes when resolving the instruction addresses, which makes me believe some sort of memory corruption is happening.
While I can’t definitively say that it is the cause when I am testing, usually involving switching streams, there are always a lot of corr_err warnings usually before a panic so it makes me think that that error recovery/handling code is not able to keep up with error spam at the extent which we are experiencing them.
Digging into the errors I am getting now I am getting errors such as
tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 0x200, status: 14
which according to the status code are from the Falcon, I get 3 kinds of errors but cannot get any more information from the err_data. I ignored the most common error in vi5_capture_dequeue and this seemed to increase stability but the system is still very prone to panicking.
I came across this thread which also seems to fit as the panics started when the firmware of our Intel cameras decreased the frame period, ie. increased the datarate which is why I believe the skew error would be a likely culprit. My issue is that I boot the Jetson from a USB disk that does not contain a firmware partition and have nothing but painful experiences using the flash scripts from Nvidia and would very much appreciate the updated Falcon firmware in a standalone form to put into my rootfs to test.
On a side note, we also experience very commonly the following errors
[ 3215.206790] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: flags 2, err_data 0
[ 3217.839769] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 3217.849012] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 3217.863781] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=4, csi_port=4
[ 3217.875830] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel
[ 3255.076773] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: flags 2, err_data 0
[ 3257.518888] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 3257.528099] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 3257.538683] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=4, csi_port=4
[ 3257.551782] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel
[ 3257.774994] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 3257.784164] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 3257.794198] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=2, csi_port=2
[ 3257.805406] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_open: VI channel not found for stream- 2 vc- 0
[ 3257.816808] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel
in case this helps give more context/is related.
Cheers,
Alex