Jetpack 4.6 Xavier boot crash when BPMP-NOC module read Incrementing timeout

We encounter some boot failure problems for Jetson AGX Xavier on our customized carried board.

Repetition steps:
1.Power off and let stand for a while
2.Power on and observe the output of the debug port,If it starts normally, power on and off the device again
3.Repeat it four or five times and That would repeat the problem。

debug port log:

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.9.253+ (liyanhou@uisee-System-Product-Name) (gcc version 7.5.0 (Linaro GCC 7.5-2019.12) ) #1 SMP PREEMPT Tue Apr 26 11:17:20 CST 2022
[    0.000000] Boot CPU: AArch64 Processor [4e0f0040]
[    0.000000] OF: fdt:memory scan node memory, reg size 48,
[    0.000000] OF: fdt: - 80000000 ,  2c000000
[    0.000000] OF: fdt: - ac200000 ,  44800000
[    0.000000] OF: fdt: - 100000000 ,  780000000
[    0.000000] earlycon: tegra_comb_uart0 at MMIO32 0x000000000c168000 (options '')
[    0.000000] bootconsole [tegra_comb_uart0] enabled
[    1.640722] ucsi_ccg 1-0008: read version failed
[    1.640846] ucsi_ccg 1-0008: get_fw_info fail,, err=-121
[    2.191234] rt5659 7-001a: Device with ID register ffffff80 is not rt5659
[    7.882195] bpmp_wait_ack() returned -110 (ch 22 mrq 3 data <0x25 0x01 0x00 0x06>)
[    7.964978] CPU1: SError detected, daif=1c0, spsr=0x60c000c5, mpidr=80000001, esr=be000000
[    7.964986] CPU7: SError detected, daif=1c0, spsr=0x60c000c5, mpidr=80000301, esr=be000000
[    7.964990] CPU6: SError detected, daif=1c0, spsr=0x608000c5, mpidr=80000300, esr=be000000
[    7.964999] CPU4: SError detected, daif=1c0, spsr=0x608000c5, mpidr=80000200, esr=be000000
[    7.965003] CPU5: SError detected, daif=1c0, spsr=0x608000c5, mpidr=80000201, esr=be000000
[    7.965012] CPU3: SError detected, daif=1c0, spsr=0x608000c5, mpidr=80000101, esr=be000000
[    7.965017] CPU2: SError detected, daif=1c0, spsr=0x608000c5, mpidr=80000100, esr=be000000
[    8.129548] CPU:0, Error:BPMP-NOC@0xd600000,irq=483
[    8.129550] **************************************
[    8.129552] * For more Internal Decode Help
[    8.129553] *     http://nv/cbberr
[    8.129554] * NVIDIA userID is required to access
[    8.129555] **************************************
[    8.129557] CPU:0, Error:BPMP-NOC
[    8.129559]  Error Logger        : 1
[    8.129568]  ErrLog0         : 0x80030600
[    8.129570]    Transaction Type  : RD  - Read, Incrementing
[    8.129572]    Error Code        : TMO
[    8.129573]    Error Source      : Target NIU
[    8.129575]    Error Description : Target time-out error
[    8.129577]    Packet header Lock    : 0
[    8.129579]    Packet header Len1    : 3
[    8.129581]    NOC protocol version  : version >= 2.7
[    8.129582]  ErrLog1         : 0xbba00
[    8.129584]  ErrLog2         : 0x0
[    8.129586]    RouteId       : 0xbba00
[    8.129588]    InitFlow      : cpu_p_i/I/0
[    8.129589]    Targflow      : cbb_t/T/0
[    8.129591]    TargSubRange      : 13
[    8.129592]    SeqId         : 0
[    8.129594]  ErrLog3         : 0x700020c
[    8.129596]  ErrLog4         : 0x0
[    8.129623]    Address       : 0x1700020c (unknown device)
[    8.129625]  ErrLog5         : 0xcfa30
[    8.129627]    Master ID     : BPMP
[    8.129629]    Security Group(GRPSEC): 0x7d
[    8.129631]    Cache         : 0x0 -- Non-cacheable/Non-Bufferable)
[    8.129634]    Protection        : 0x3 -- Privileged, Non-Secure, Data Access
[    8.129635]    FALCONSEC     : 0x0
[    8.129637]    Virtual Queuing Channel(VQC): 0x0
[    8.129640]  **************************************
[    8.129716] CPU0: SError detected, daif=1c0, spsr=0x40400045, mpidr=80000000, esr=be000000
[   12.211802] **************************************
[   12.211921] * For more Internal Decode Help
[   12.211993] *     http://nv/cbberr
[   12.212052] * NVIDIA userID is required to access
[   12.212131] **************************************
[   12.212214] CPU:7, Error:BPMP-NOC
[   12.212274]  Error Logger        : 1
[   12.212334]  ErrLog0         : 0x80030608
[   12.212400]    Transaction Type  : WR  - Write, Incrementing
[   12.212496]    Error Code        : TMO
[   12.212554]    Error Source      : Target NIU
[   12.212624]    Error Description : Target time-out error
[   12.212718]    Packet header Lock    : 0
[   12.212784]    Packet header Len1    : 3
[   12.212847]    NOC protocol version  : version >= 2.7
[   12.212930]  ErrLog1         : 0xbaa01
[   12.212988]  ErrLog2         : 0x0
[   12.213066]    RouteId       : 0xbaa01
[   12.213363]    InitFlow      : cpu_p_i/I/0
[   12.213646]    Targflow      : cbb_t/T/0
[   12.213943]    TargSubRange      : 5
[   12.214186]    SeqId         : 0
[   12.214425]  ErrLog3         : 0x190000
[   12.214682]  ErrLog4         : 0x0
[   12.214921]    Address       : 0xc190000 (unknown device)
[   12.217904]  ErrLog5         : 0xcfa30
[   12.221144]    Master ID     : BPMP
[   12.224378]    Security Group(GRPSEC): 0x7d
[   12.228404]    Cache         : 0x0 -- Non-cacheable/Non-Bufferable)
[   12.234007]    Protection        : 0x3 -- Privileged, Non-Secure, Data Access
[   12.240655]    FALCONSEC     : 0x0
[   12.243804]    Virtual Queuing Channel(VQC): 0x0
[   12.248274]  **************************************
[   12.253108] **************************************
[   12.258329] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   12.263054]  Status = 0xfc009604
[   12.266380]  IERR = CBB Interface Error: 0x96
[   12.270410]  SERR = Assertion Failure: 0x4
[   12.274690]  Overflow (there may be more errors) - Uncorrectable
[   12.280383]  Uncorrectable (this is fatal)
[   12.284847]  MISC0 = 0x40
[   12.287033]  MISC1 = 0x264a4445e1
[   12.290532]  ADDR = 0x800000001404000c
[   12.294469] **************************************
[   12.299210] **************************************
[   12.304442] RAS Error in L2, ERRSELR_EL1=512:
[   12.308729]  Status = 0xfc006612
[   12.312142]  IERR = SCF to L2 Slave Error Read: 0x66
[   12.317041]  SERR = Error response from slave: 0x12
[   12.321682]  Overflow (there may be more errors) - Uncorrectable
[   12.327804]  Uncorrectable (this is fatal)
[   12.331662]  MISC0 = 0x80000000400000
[   12.335418]  MISC1 = 0x20240000000
[   12.338743]  ADDR = 0x800000001404000c
[   12.342336] **************************************
[   12.347354] **************************************
[   12.352480] RAS Error in L2, ERRSELR_EL1=560:
[   12.356594]  Status = 0xfc006612
[   12.360267]  IERR = SCF to L2 Slave Error Read: 0x66
[   12.365079]  SERR = Error response from slave: 0x12
[   12.369892]  Overflow (there may be more errors) - Uncorrectable
[   12.375408]  Uncorrectable (this is fatal)
[   12.379694]  MISC0 = 0x100000000400000
[   12.383628]  MISC1 = 0x40240000000
[   12.386785]  ADDR = 0x800000001404000c
[   12.390893] **************************************
[   12.395715] Bad mode in Error handler detected on CPU7, code 0xbe000000 -- SError
[   12.403405] Kernel panic - not syncing: bad mode
[   12.407874] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G        W       4.9.253+ #1
[   12.415127] Hardware name: Jetson-AGX (DT)
[   12.418896] Call trace:
[   12.421784] [<ffffff800808ba40>] dump_backtrace+0x0/0x198
[   12.426774] [<ffffff800808c004>] show_stack+0x24/0x30
[   12.432023] [<ffffff8008f660ac>] dump_stack+0xa0/0xc4
[   12.437444] [<ffffff8008f63150>] panic+0x12c/0x2a8
[   12.442431] [<ffffff800808c894>] bad_mode+0x7c/0x80
[   12.447331] [<ffffff800808ca5c>] handle_serr+0x124/0x128
[   12.452669] [<ffffff8008082d98>] el1_serr+0xb0/0x144
[   12.457482] [<ffffff80081114b4>] cpu_startup_entry+0xfc/0x150
[   12.462825] [<ffffff8008091cf8>] secondary_start_kernel+0x190/0x1f8
[   12.469035] [<0000000080f731a8>] 0x80f731a8
[   12.473247] SMP: stopping secondary CPUs
[   13.607853] SMP: failed to stop secondary CPUs 0-7
[   13.607965] Kernel Offset: disabled
[   13.608030] Memory Limit: none
[   13.608087] trusty-log panic notifier - trusty version Built: 12:20:34 Jul 26 2021 [   13.615012] Rebooting in 5 seconds..
[   16.642541] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   16.642726] **************************************
[   16.642811] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   16.642896]  Status = 0xfc009604
[   16.642954]  IERR = CBB Interface Error: 0x96
[   16.643030]  SERR = Assertion Failure: 0x4
[   16.643100]  Overflow (there may be more errors) - Uncorrectable
[   16.643201]  Uncorrectable (this is fatal)
[   16.643276]  MISC0 = 0x40
[   16.643323]  MISC1 = 0x264e444421
[   16.643388]  ADDR = 0x800000001404000c
[   16.643458] **************************************
[   16.643543] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   16.643733] **************************************
[   16.643846] RAS Error in L2, ERRSELR_EL1=560:
[   16.644197]  Status = 0xfc006612
[   16.644456]  IERR = SCF to L2 Slave Error Read: 0x66
[   16.644833]  SERR = Error response from slave: 0x12
[   16.645215]  Overflow (there may be more errors) - Uncorrectable
[   16.645675]  Uncorrectable (this is fatal)
[   16.648111]  MISC0 = 0x80000000400000
[   16.651783]  MISC1 = 0x20240000000
[   16.655198]  ADDR = 0x800000001404000c
[   16.658790] **************************************
[   16.663694] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   16.671734] Bad mode in Error handler detected on CPU6, code 0xbe000000 -- SError
[   18.615813] SMP: stopping secondary CPUs
[   19.749482] SMP: failed to stop secondary CPUs 0-7
����[   20.918568] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   20.918764] **************************************
[   20.918854] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   20.918937]  Status = 0xfc009604
[   20.918997]  IERR = CBB Interface Error: 0x96
[   20.919073]  SERR = Assertion Failure: 0x4
[   20.919144]  Overflow (there may be more errors) - Uncorrectable
[   20.919247]  Uncorrectable (this is fatal)
[   20.919322]  MISC0 = 0x40
[   20.919369]  MISC1 = 0x26424445a1
[   20.919429]  ADDR = 0x800000001404000c
[   20.919500] **************************************
[   20.919586] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   20.919769] **************************************
[   20.919898] RAS Error in L2, ERRSELR_EL1=544:
[   20.920247]  Status = 0xfc006612
[   20.920492]  IERR = SCF to L2 Slave Error Read: 0x66
[   20.920886]  SERR = Error response from slave: 0x12
[   20.921271]  Overflow (there may be more errors) - Uncorrectable
[   20.921724]  Uncorrectable (this is fatal)
[   20.924314]  MISC0 = 0x80000000400000
[   20.927985]  MISC1 = 0x20240000000
[   20.931399]  ADDR = 0x800000001404000c
[   20.935252] **************************************
[   20.940163] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   20.948202] Bad mode in Error handler detected on CPU4, code 0xbe000000 -- SError
[   25.195030] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   25.195217] **************************************
[   25.195302] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   25.195387]  Status = 0xfc009604
[   25.195447]  IERR = CBB Interface Error: 0x96
[   25.195522]  SERR = Assertion Failure: 0x4
[   25.195592]  Overflow (there may be more errors) - Uncorrectable
[   25.195692]  Uncorrectable (this is fatal)
[   25.195766]  MISC0 = 0x40
[   25.195813]  MISC1 = 0x26424444a3
[   25.195877]  ADDR = 0x800000001404000c
[   25.195946] **************************************
[   25.196031] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   25.196212] **************************************
[   25.196336] RAS Error in L2, ERRSELR_EL1=544:
[   25.196688]  Status = 0xfc006612
[   25.196944]  IERR = SCF to L2 Slave Error Read: 0x66
[   25.197322]  SERR = Error response from slave: 0x12
[   25.197690]  Overflow (there may be more errors) - Uncorrectable
[   25.198159]  Uncorrectable (this is fatal)
[   25.200604]  MISC0 = 0x100000000400000
[   25.204537]  MISC1 = 0x40240000000
[   25.207953]  ADDR = 0x800000001404000c
[   25.211801] **************************************
[   25.216716] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   25.224489] Bad mode in Error handler detected on CPU5, code 0xbe000000 -- SError
[   29.471333] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   29.471525] **************************************
[   29.471610] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   29.471694]  Status = 0xfc009604
[   29.471754]  IERR = CBB Interface Error: 0x96
[   29.471830]  SERR = Assertion Failure: 0x4
[   29.471901]  Overflow (there may be more errors) - Uncorrectable
[   29.472002]  Uncorrectable (this is fatal)
[   29.472078]  MISC0 = 0x40
[   29.472125]  MISC1 = 0x2646444423
[   29.472186]  ADDR = 0x800000001404000c
[   29.472257] **************************************
[   29.472343] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   29.472517] **************************************
[   29.472624] RAS Error in L2, ERRSELR_EL1=528:
[   29.472997]  Status = 0xfc006612
[   29.473239]  IERR = SCF to L2 Slave Error Read: 0x66
[   29.473629]  SERR = Error response from slave: 0x12
[   29.474000]  Overflow (there may be more errors) - Uncorrectable
[   29.474469]  Uncorrectable (this is fatal)
[   29.476908]  MISC0 = 0x100000000400000
[   29.480843]  MISC1 = 0x40240000000
[   29.484258]  ADDR = 0x800000001404000c
[   29.487851] **************************************
[   29.492773] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   29.500796] Bad mode in Error handler detected on CPU3, code 0xbe000000 -- SError
[   33.747616] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   33.747801] **************************************
[   33.747886] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   33.747971]  Status = 0xfc009604
[   33.748030]  IERR = CBB Interface Error: 0x96
[   33.748107]  SERR = Assertion Failure: 0x4
[   33.748178]  Overflow (there may be more errors) - Uncorrectable
[   33.748278]  Uncorrectable (this is fatal)
[   33.748357]  MISC0 = 0x40
[   33.748405]  MISC1 = 0x26464444e1
[   33.748467]  ADDR = 0x800000001404000c
[   33.748539] **************************************
[   33.748624] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   33.748800] **************************************
[   33.748919] RAS Error in L2, ERRSELR_EL1=528:
[   33.749273]  Status = 0xfc006612
[   33.749532]  IERR = SCF to L2 Slave Error Read: 0x66
[   33.749908]  SERR = Error response from slave: 0x12
[   33.750275]  Overflow (there may be more errors) - Uncorrectable
[   33.750749]  Uncorrectable (this is fatal)
[   33.753187]  MISC0 = 0x80000000400000
[   33.756602]  MISC1 = 0x20240000000
[   33.760274]  ADDR = 0x800000001404000c
[   33.764123] **************************************
[   33.769048] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   33.777068] Bad mode in Error handler detected on CPU2, code 0xbe000000 -- SError
[   38.023901] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   38.024089] **************************************
[   38.024174] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   38.024259]  Status = 0xfc009604
[   38.024318]  IERR = CBB Interface Error: 0x96
[   38.024394]  SERR = Assertion Failure: 0x4
[   38.024465]  Overflow (there may be more errors) - Uncorrectable
[   38.024564]  Uncorrectable (this is fatal)
[   38.024642]  MISC0 = 0x40
[   38.024689]  MISC1 = 0x264a4445a1
[   38.024750]  ADDR = 0x800000001404000c
[   38.024819] **************************************
[   38.024904] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   38.025064] **************************************
[   38.025208] RAS Error in L2, ERRSELR_EL1=512:
[   38.025548]  Status = 0xfc006612
[   38.025822]  IERR = SCF to L2 Slave Error Read: 0x66
[   38.026201]  SERR = Error response from slave: 0x12
[   38.026566]  Overflow (there may be more errors) - Uncorrectable
[   38.027034]  Uncorrectable (this is fatal)
[   38.029215]  MISC0 = 0x80000000400000
[   38.033147]  MISC1 = 0x20240000000
[   38.036559]  ADDR = 0x800000001404000c
[   38.040151] **************************************
[   38.045090] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   38.053356] Bad mode in Error handler detected on CPU0, code 0xbe000000 -- SError
[   42.300170] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   42.300356] **************************************
[   42.300439] RAS Error in SCF:IOB, ERRSELR_EL1=1025:
[   42.300525]  Status = 0xfc009604
[   42.300584]  IERR = CBB Interface Error: 0x96
[   42.300660]  SERR = Assertion Failure: 0x4
[   42.300731]  Overflow (there may be more errors) - Uncorrectable
[   42.300831]  Uncorrectable (this is fatal)
[   42.300910]  MISC0 = 0x40
[   42.300958]  MISC1 = 0x264a444523
[   42.301019]  ADDR = 0x800000001404000c
[   42.301092] **************************************
[   42.301178] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   42.301343] **************************************
[   42.301476] RAS Error in L2, ERRSELR_EL1=512:
[   42.301829]  Status = 0xfc006612
[   42.302085]  IERR = SCF to L2 Slave Error Read: 0x66
[   42.302448]  SERR = Error response from slave: 0x12
[   42.302830]  Overflow (there may be more errors) - Uncorrectable
[   42.303287]  Uncorrectable (this is fatal)
[   42.305483]  MISC0 = 0x100000000400000
[   42.309419]  MISC1 = 0x40240000000
[   42.313090]  ADDR = 0x800000001404000c
[   42.316940] **************************************
[   42.321875] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   42.329626] Bad mode in Error handler detected on CPU1, code 0xbe000000 -- SError

The startup part adopts the non-MCU solution recommended by the OEM documentation。Attached is the circuit diagram
power up master.pdf (140.2 KB)

Hi,

The last time another user hit this issue because of temperature, could we rule out this situation first?

Our test environment is room temperature, 20°C

Is there any peripheral that was existing on devkit but got removed on your side?

Yanhou.LI and I are colleagues. we delete below parts from devkit:

  1. EEPROM
    9 ,PCIe X16 Connector, replace with a *8 PCIE interface connect to another Xavier.
    12.M.2 Key E
  2. CODEC and HD Audio header
    17.AP Debug Connector
    18.ESATA Connector
    20.microSD & UFS Socket

replace USB type-c with type A

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Did you remember to make the corresponding change in your device tree for above change?