Hello.
We have devices based on AGX Xavier with 4 video cameras (Jetpack 4.3, L4T 32.3.1 with applied patches: 0001-vi5-Fix-error-path-in-start-streaming-API.patch (4.6 KB) ) 0004-tegra-capture-ivc-WAR-add-check-for-msg_id.patch (1.9 KB)
Xavier reboots sometimes and we see this behaviour on multiple devices. Reboots happen very rarely (after 2-3 days of normal operation).
System logs of devices looks very similar and have following messages:>
[339442.094271] tegra194-vi5 15c10000.vi: corr_err: discarding frame 32, flags: 0, err_data 512
[339442.095479] tegra194-vi5 15c10000.vi: corr_err: discarding frame 0, flags: 32, err_data 163
[339442.133275] tegra194-vi5 15c10000.vi: corr_err: discarding frame 2, flags: 0, err_data 512
[339442.162638] tegra194-vi5 15c10000.vi: corr_err: discarding frame 3, flags: 0, err_data 512
[339442.228435] tegra194-vi5 15c10000.vi: corr_err: discarding frame 4, flags: 0, err_data 131072
[339442.228866] tegra194-vi5 15c10000.vi: corr_err: discarding frame 5, flags: 0, err_data 131072
[339442.638671] tegra194-vi5 15c10000.vi: corr_err: discarding frame 17, flags: 0, err_data 512
[339442.695063] tegra194-vi5 15c10000.vi: corr_err: discarding frame 18, flags: 0, err_data 131072
[339442.728305] tegra194-vi5 15c10000.vi: corr_err: discarding frame 19, flags: 0, err_data 131072
[339442.794994] tegra194-vi5 15c10000.vi: corr_err: discarding frame 21, flags: 0, err_data 131072
[339442.826873] tegra194-vi5 15c10000.vi: corr_err: discarding frame 22, flags: 0, err_data 512
[339442.994962] tegra194-vi5 15c10000.vi: corr_err: discarding frame 27, flags: 0, err_data 131072
[339443.120704] tegra194-vi5 15c10000.vi: corr_err: discarding frame 31, flags: 0, err_data 512
[339443.167608] tegra194-vi5 15c10000.vi: corr_err: discarding frame 33, flags: 0, err_data 512
[339443.209726] tegra194-vi5 15c10000.vi: corr_err: discarding frame 34, flags: 0, err_data 512
[339443.246360] tegra194-vi5 15c10000.vi: corr_err: discarding frame 35, flags: 0, err_data 512
[339443.262141] tegra194-vi5 15c10000.vi: corr_err: discarding frame 36, flags: 0, err_data 512
[339443.403682] tegra194-vi5 15c10000.vi: corr_err: discarding frame 40, flags: 0, err_data 512
tegra194-vi5 15c10000.vi: corr_err: discarding frame 41, flags: 0, err_data 512
[ 0.000000] Booting Linux on physical CPU 0x0
It seems that reboots happen due to video frame capturing error (tegra194-vi5 15c10000.vi: corr_err: discarding frame …).
We also got debug console output from one of devices. It shows following:
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.253514] ub953 11-0033: div-m-val=0x01 hs-clk-div=0x02 div-n-val=0x28 gpio-rmten=0x00 gpio-out-src=0x08 i2c-voltage-sel=0x00
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.353491] ub953 9-0033: div-m-val=0x01 hs-clk-div=0x02 div-n-val=0x28 gpio-rmten=0x00 gpio-out-src=0x08 i2c-voltage-sel=0x00
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.377515] ub960 9-0038: RX 1: CSI_TX_ISR: IS_CSI_PASS_ERROR, CSI_TX_ISR: IS_CSI_PASS, RX_PORT_STS1: LOCK_STS_CHG, RX_PORT_STS1: LOCK_STS, RX_PORT_STS
фев 09 11:10:43 xavier2 remote-console-log[5188]: [91745.378065] ub960 9-0038: Pin isnt’ configured for output
фев 09 11:10:44 xavier2 remote-console-log[5188]: [91746.405498] ub953 9-0031: div-m-val=0x01 hs-clk-div=0x02 div-n-val=0x28 gpio-rmten=0x00 gpio-out-src=0x08 i2c-voltage-sel=0x00
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570313] CPU5: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000201, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570319] CPU3: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000101, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570327] CPU1: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000001, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570333] CPU4: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000200, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570340] CPU7: SError detected, daif=1c0, spsr=0x40c000c5, mpidr=80000301, esr=be000000
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570394] **************************************
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570396] * For more Internal Decode Help
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570397] * http://nv/cbberr
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570398] * NVIDIA userID is required to access
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570400] **************************************
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570402] CPU:3, Error:RCE-NOC
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570405] Error Logger : 1
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570414] ErrLog0 : 0x80030600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570416] Transaction Type : RD - Read, Incrementing
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570417] Error Code : TMO
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570419] Error Source : Target NIU
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570421] Error Description : Target time-out error
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570422] Packet header Lock : 0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570424] Packet header Len1 : 3
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570425] NOC protocol version : version >= 2.7
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570427] ErrLog1 : 0x157600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570428] ErrLog2 : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570429] RouteId : 0x157600
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570431] InitFlow : cpu_p_i/I/0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570433] Targflow : cbb_t/T/0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570435] TargSubRange : 27
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570436] SeqId : 0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570437] ErrLog3 : 0x5c00414
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570439] ErrLog4 : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570473] Address : 0x15c00414 (unknown device)
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570475] ErrLog5 : 0x387e31
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570477] Master ID : RCE
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570478] Security Group(GRPSEC): 0x3f
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570479] Cache : 0x1 – Device
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570482] Protection : 0x3 – Privileged, Non-Secure, Data Access
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570483] FALCONSEC : 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570485] Virtual Queuing Channel(VQC): 0x0
фев 09 11:14:14 xavier2 remote-console-log[5188]: [91956.570488] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570500] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570502] * For more Internal Decode Help
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570503] * http://nv/cbberr
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570504] * NVIDIA userID is required to access
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570505] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570507] CPU:3, Error:CBB-NOC
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570508] Error Logger : 1
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570514] ErrLog0 : 0x80030600
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570516] Transaction Type : RD - Read, Incrementing
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570517] Error Code : TMO
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570519] Error Source : Target NIU
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570520] Error Description : Target time-out error
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570522] Packet header Lock : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570523] Packet header Len1 : 3
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570525] NOC protocol version : version >= 2.7
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570526] ErrLog1 : 0x9528aa
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570532] CPU6: SError detected, daif=140, spsr=0x80400145, mpidr=80000300, esr=be000000
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570534] ErrLog2 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570536] RouteId : 0x9528aa
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570538] InitFlow : rce_p2ps/I/rce_p2ps
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570540] Targflow : host1x_p2pm/T/host1x_p2pm
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570541] TargSubRange : 20
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570542] SeqId : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570544] ErrLog3 : 0x414
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570545] ErrLog4 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570549] Address : 0x15c00414 (unknown device)
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570550] ErrLog5 : 0x2af0fc71
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570551] Non-Modify : 0x1
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570553] AXI ID : 0x55
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570554] Master ID : RCE
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570556] Security Group(GRPSEC): 0x3f
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570557] Cache : 0x1 – Device
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570559] Protection : 0x3 – Privileged, Non-Secure, Data Access
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570560] FALCONSEC : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570562] Virtual Queuing Channel(VQC): 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570564] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570567] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570585] CPU:0, Error:CBB-NOC@0x2300000,irq=476
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570587] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570588] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570590] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570592] * For more Internal Decode Help
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570594] Status = 0xf4009604
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570596] * http://nv/cbberr
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570598] IERR = CBB Interface Error: 0x96
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570599] * NVIDIA userID is required to access
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570601] SERR = Assertion Failure: 0x4
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570602] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570604] Uncorrectable (this is fatal)
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570605] CPU:0, Error:CBB-NOC
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570607] Error Logger : 1
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570610] MISC0 = 0x40
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570611] MISC1 = 0x264e444561
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570614] ErrLog0 : 0x80030600
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570616] ADDR = 0x8000000013e16464
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570620] Transaction Type : RD - Read, Incrementing
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570622] **************************************
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570623] Error Code : TMO
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570628] Error Source : Target NIU
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570630] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570631] Error Description : Target time-out error
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570633] Packet header Lock : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570634] Packet header Len1 : 3
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570636] NOC protocol version : version >= 2.7
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570637] ErrLog1 : 0x351a2a
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570638] ErrLog2 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570640] RouteId : 0x351a2a
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570641] InitFlow : ccroc_p2ps/I/ccroc_p2ps
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570643] Targflow : host1x_p2pm/T/host1x_p2pm
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570647] TargSubRange : 13
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570648] SeqId : 0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570650] ErrLog3 : 0x16464
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570651] ErrLog4 : 0x0
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570667] Address : 0x13e16464 – guest + 0x6464
фев 09 11:14:15 xavier2 remote-console-log[5188]: [91956.570668] ErrLog5 : 0xa89f851
However we couldn’t reproduce reboots by starting/stoping, pluging/unpluging cameras.
After changing source code of vi5_fops.c to simulate “discarding frame” errors, we managed to reproduce the issue. It takes at about one hour to reproduce it. vi5_fops.c with modifications is attached.vi5_fops.c (24.3 KB)
We also found that if uncomment “buf->vb2_state = VB2_BUF_STATE_ACTIVE;”, the issue happens almost immediatelly. (after 1-2 minutes).
- Do you have any ideas how to fix it?
- Could you please clarify how to decode the err_data field?