Xavier AGX reboots sometimes after camera frame capturing error

Hello.

We have devices based on AGX Xavier with 4 video cameras (Jetpack 4.3, L4T 32.3.1 with applied patches: 0001-vi5-Fix-error-path-in-start-streaming-API.patch (4.6 KB) ) 0004-tegra-capture-ivc-WAR-add-check-for-msg_id.patch (1.9 KB)
Xavier reboots sometimes and we see this behaviour on multiple devices. Reboots happen very rarely (after 2-3 days of normal operation).

System logs of devices looks very similar and have following messages:>

[339442.094271] tegra194-vi5 15c10000.vi: corr_err: discarding frame 32, flags: 0, err_data 512
[339442.095479] tegra194-vi5 15c10000.vi: corr_err: discarding frame 0, flags: 32, err_data 163
[339442.133275] tegra194-vi5 15c10000.vi: corr_err: discarding frame 2, flags: 0, err_data 512
[339442.162638] tegra194-vi5 15c10000.vi: corr_err: discarding frame 3, flags: 0, err_data 512
[339442.228435] tegra194-vi5 15c10000.vi: corr_err: discarding frame 4, flags: 0, err_data 131072
[339442.228866] tegra194-vi5 15c10000.vi: corr_err: discarding frame 5, flags: 0, err_data 131072
[339442.638671] tegra194-vi5 15c10000.vi: corr_err: discarding frame 17, flags: 0, err_data 512
[339442.695063] tegra194-vi5 15c10000.vi: corr_err: discarding frame 18, flags: 0, err_data 131072
[339442.728305] tegra194-vi5 15c10000.vi: corr_err: discarding frame 19, flags: 0, err_data 131072
[339442.794994] tegra194-vi5 15c10000.vi: corr_err: discarding frame 21, flags: 0, err_data 131072
[339442.826873] tegra194-vi5 15c10000.vi: corr_err: discarding frame 22, flags: 0, err_data 512
[339442.994962] tegra194-vi5 15c10000.vi: corr_err: discarding frame 27, flags: 0, err_data 131072
[339443.120704] tegra194-vi5 15c10000.vi: corr_err: discarding frame 31, flags: 0, err_data 512
[339443.167608] tegra194-vi5 15c10000.vi: corr_err: discarding frame 33, flags: 0, err_data 512
[339443.209726] tegra194-vi5 15c10000.vi: corr_err: discarding frame 34, flags: 0, err_data 512
[339443.246360] tegra194-vi5 15c10000.vi: corr_err: discarding frame 35, flags: 0, err_data 512
[339443.262141] tegra194-vi5 15c10000.vi: corr_err: discarding frame 36, flags: 0, err_data 512
[339443.403682] tegra194-vi5 15c10000.vi: corr_err: discarding frame 40, flags: 0, err_data 512
tegra194-vi5 15c10000.vi: corr_err: discarding frame 41, flags: 0, err_data 512
[ 0.000000] Booting Linux on physical CPU 0x0

It seems that reboots happen due to video frame capturing error (tegra194-vi5 15c10000.vi: corr_err: discarding frame …).
We also got debug console output from one of devices. It shows following:

фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.253514] ub953 11-0033: div-m-val=0x01 hs-clk-div=0x02 div-n-val=0x28 gpio-rmten=0x00 gpio-out-src=0x08 i2c-voltage-sel=0x00
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.353491] ub953 9-0033: div-m-val=0x01 hs-clk-div=0x02 div-n-val=0x28 gpio-rmten=0x00 gpio-out-src=0x08 i2c-voltage-sel=0x00
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.377515] ub960 9-0038: RX 1: CSI_TX_ISR: IS_CSI_PASS_ERROR, CSI_TX_ISR: IS_CSI_PASS, RX_PORT_STS1: LOCK_STS_CHG, RX_PORT_STS1: LOCK_STS, RX_PORT_STS
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.378065] ub960 9-0038: Pin isnt’ configured for output
фев 09 11:10:44 xavier2 remote-console-log[5188]: [91746.405498] ub953 9-0031: div-m-val=0x01 hs-clk-div=0x02 div-n-val=0x28 gpio-rmten=0x00 gpio-out-src=0x08 i2c-voltage-sel=0x00
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570313] CPU5: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000201, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570319] CPU3: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000101, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570327] CPU1: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000001, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570333] CPU4: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000200, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570340] CPU7: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000301, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570394] **************************************
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570396] * For more Internal Decode Help
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570397] * http://nv/cbberr
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570398] * NVIDIA userID is required to access
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570400] **************************************
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570402] CPU:3, Error:RCE-NOC
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570405] Error Logger : 1
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570414] ErrLog0 : 0x80030600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570416] Transaction Type : RD - Read, Incrementing
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570417] Error Code : TMO
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570419] Error Source : Target NIU
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570421] Error Description : Target time-out error
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570422] Packet header Lock : 0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570424] Packet header Len1 : 3
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570425] NOC protocol version : version >= 2.7
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570427] ErrLog1 : 0x157600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570428] ErrLog2 : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570429] RouteId : 0x157600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570431] InitFlow : cpu_p_i/I/0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570433] Targflow : cbb_t/T/0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570435] TargSubRange : 27
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570436] SeqId : 0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570437] ErrLog3 : 0x5c00414
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570439] ErrLog4 : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570473] Address : 0x15c00414 (unknown device)
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570475] ErrLog5 : 0x387e31
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570477] Master ID : RCE
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570478] Security Group(GRPSEC): 0x3f
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570479] Cache : 0x1 – Device
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570482] Protection : 0x3 – Privileged, Non-Secure, Data Access
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570483] FALCONSEC : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570485] Virtual Queuing Channel(VQC): 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570488] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570500] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570502] * For more Internal Decode Help
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570503] * http://nv/cbberr
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570504] * NVIDIA userID is required to access
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570505] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570507] CPU:3, Error:CBB-NOC
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570508] Error Logger : 1
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570514] ErrLog0 : 0x80030600
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570516] Transaction Type : RD - Read, Incrementing
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570517] Error Code : TMO
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570519] Error Source : Target NIU
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570520] Error Description : Target time-out error
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570522] Packet header Lock : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570523] Packet header Len1 : 3
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570525] NOC protocol version : version >= 2.7
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570526] ErrLog1 : 0x9528aa
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570532] CPU6: SError detected, daif=140, spsr=0x80400145, mpidr=80000300, esr=be000000
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570534] ErrLog2 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570536] RouteId : 0x9528aa
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570538] InitFlow : rce_p2ps/I/rce_p2ps
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570540] Targflow : host1x_p2pm/T/host1x_p2pm
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570541] TargSubRange : 20
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570542] SeqId : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570544] ErrLog3 : 0x414
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570545] ErrLog4 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570549] Address : 0x15c00414 (unknown device)
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570550] ErrLog5 : 0x2af0fc71
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570551] Non-Modify : 0x1
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570553] AXI ID : 0x55
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570554] Master ID : RCE
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570556] Security Group(GRPSEC): 0x3f
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570557] Cache : 0x1 – Device
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570559] Protection : 0x3 – Privileged, Non-Secure, Data Access
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570560] FALCONSEC : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570562] Virtual Queuing Channel(VQC): 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570564] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570567] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570585] CPU:0, Error:CBB-NOC@0x2300000,irq=476
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570587] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570588] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570590] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570592] * For more Internal Decode Help
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570594] Status = 0xf4009604
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570596] * http://nv/cbberr
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570598] IERR = CBB Interface Error: 0x96
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570599] * NVIDIA userID is required to access
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570601] SERR = Assertion Failure: 0x4
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570602] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570604] Uncorrectable (this is fatal)
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570605] CPU:0, Error:CBB-NOC
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570607] Error Logger : 1
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570610] MISC0 = 0x40
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570611] MISC1 = 0x264e444561
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570614] ErrLog0 : 0x80030600
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570616] ADDR = 0x8000000013e16464
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570620] Transaction Type : RD - Read, Incrementing
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570622] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570623] Error Code : TMO
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570628] Error Source : Target NIU
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570630] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570631] Error Description : Target time-out error
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570633] Packet header Lock : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570634] Packet header Len1 : 3
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570636] NOC protocol version : version >= 2.7
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570637] ErrLog1 : 0x351a2a
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570638] ErrLog2 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570640] RouteId : 0x351a2a
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570641] InitFlow : ccroc_p2ps/I/ccroc_p2ps
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570643] Targflow : host1x_p2pm/T/host1x_p2pm
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570647] TargSubRange : 13
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570648] SeqId : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570650] ErrLog3 : 0x16464
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570651] ErrLog4 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570667] Address : 0x13e16464 – guest + 0x6464
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570668] ErrLog5 : 0xa89f851

However we couldn’t reproduce reboots by starting/stoping, pluging/unpluging cameras.

After changing source code of vi5_fops.c to simulate “discarding frame” errors, we managed to reproduce the issue. It takes at about one hour to reproduce it. vi5_fops.c with modifications is attached.vi5_fops.c (24.3 KB)

We also found that if uncomment “buf->vb2_state = VB2_BUF_STATE_ACTIVE;”, the issue happens almost immediatelly. (after 1-2 minutes).

  1. Do you have any ideas how to fix it?
  2. Could you please clarify how to decode the err_data field?

You can check the …/kernel/nvidia/include/soc/tegra/camrtc-capture.h for the flags

Hi,ShaneCCC
Could you explain, when the flag field is 0x0 - is the err_data field contains flags from CAPTURE_CHANNEL_ERROR_* ?

Hi ShaneCCC,

We found one more way to crash the system by means of mmap.
Memory is mapped successfully, but system is crashed after memory access. That may be related with the first case.
Here is mmap_crash.c (644 Bytes) to reproduce the issue. Here is log mmap_crash.log (28.9 KB)

ShaneCCCModerator

Feb 17

You can check the …/kernel/nvidia/include/soc/tegra/camrtc-capture.h for the flags

Which set of flags should I check in camrtc-capture.h for the err_data field?

In new error log and app code, seems they are accessing SYSRAM address range by accessing ‘0x40200000’.
Address is in aperture: SYSRAM_0 [0x40000000 - 0x4fffffff] with locality: {SYSTEM}
So, getting below RAS error with that address.
[ 73.273101] CPU7: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000301, esr=be000000
[ 73.273821] **************************************
[ 73.273853] RAS Error in L2, ERRSELR_EL1=528:
[ 73.273947] Status = 0xfc00640d
[ 73.273985] IERR = SCF to L2 Decode Error Read: 0x64
[ 73.274017] SERR = Illegal address (software fault): 0xd
[ 73.274044] Overflow (there may be more errors) - Uncorrectable
[ 73.274086] Uncorrectable (this is fatal)
[ 73.274200] MISC0 = 0x80000000100000
[ 73.274230] MISC1 = 0x20240000000
[ 73.274286] ADDR = 0x8000000040200000
[ 73.274425] **************************************

This is different from the original error reported in forum post.
Original error is a CBB timeout error and coming on reading register VI_FW_CFG_INT_STATUS_1_0 address “0x15c00414” from RCE master.

фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570340] CPU7: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000301, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570394] **************************************
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570396] * For more Internal Decode Help
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570397] * http://nv/cbberr
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570398] * NVIDIA userID is required to access
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570400] **************************************
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570402] CPU:3, Error:RCE-NOC
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570405] Error Logger : 1
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570414] ErrLog0 : 0x80030600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570416] Transaction Type : RD - Read, Incrementing
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570417] Error Code : TMO
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570419] Error Source : Target NIU
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570421] Error Description : Target time-out error
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570422] Packet header Lock : 0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570424] Packet header Len1 : 3
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570425] NOC protocol version : version >= 2.7
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570427] ErrLog1 : 0x157600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570428] ErrLog2 : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570429] RouteId : 0x157600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570431] InitFlow : cpu_p_i/I/0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570433] Targflow : cbb_t/T/0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570435] TargSubRange : 27
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570436] SeqId : 0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570437] ErrLog3 : 0x5c00414
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570439] ErrLog4 : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570473] Address : 0x15c00414 (unknown device)
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570475] ErrLog5 : 0x387e31
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570477] Master ID : RCE
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570478] Security Group(GRPSEC): 0x3f
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570479] Cache : 0x1 – Device
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570482] Protection : 0x3 – Privileged, Non-Secure, Data Access
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570483] FALCONSEC : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570485] Virtual Queuing Channel(VQC): 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570488] **************************************

Hi, ShaneCCC.

I created a new topic for the second case Xavier AGX kernel panic after mmap.

Do you have any updates regarding the first case?

Could you check if single camera case?