Orinnx reboot repeatly but it switch to B unexpectly

Yes and here are the boot logs of the problematic boots.

Crash in EL3 that occurs during the DXE phase of the UEFI:

[04:11:16.300] DtPlatformLoadDtb: Defaulting to UEFI DTB
[04:11:16.301] Processing "L4T Configuration Settings" DTB overlay
[04:11:16.303] Deleting fragment fragment@0
[04:11:16.303] Processing "P3767 Overlay Support" DTB overlay
[04:11:16.303] Deleting fragment fragment@0
[04:11:16.303] Deleting fragment fragment@1
[04:11:16.304] Deleting fragment fragment@2
[04:11:16.305] Deleting fragment fragment@3
[04:11:16.368] UpdateRamOopsMemory: RamOopsBase: 0x46EB70000, RamOopsSize: 0x200000
[04:11:16.369] FtpmProtocol Not Found - Not Found
[04:11:16.369] DisplayLocateChildGopHandle: failed to enumerate graphics output device handles: Not Found
[04:11:16.369] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/UsbPadCtlDxe/UsbPadCtlDxe/DEBUG/UsbPadCtlDxe.dll 0x467C68000
[04:11:16.370] Loading driver at 0x00467C67000 EntryPoint=0x00467C6EC74 UsbPadCtlDxe.efi

[04:11:16.370] DeviceDiscoveryNotify: Couldn't get gNVIDIAPinMuxProtocolGuid Handle: Not Found
[04:11:16.417] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/XusbControllerDxe/XusbControllerDxe/DEBUG/XusbControllerDxe.dll 0x467C5E000
[04:11:16.421] Loading driver at 0x00467C5D000 EntryPoint=0x00467C62320 XusbControllerDxe.efi

[04:11:16.421] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/SdMmcControllerDxe/SdMmcControllerDxe/DEBUG/SdMmcControllerDxe.dll 0x467C53000
[04:11:16.422] Loading driver at 0x00467C52000 EntryPoint=0x00467C57F08 SdMmcControllerDxe.efi

[04:11:16.466] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/AndroidBootDxe/AndroidBootDxe/DEBUG/AndroidBootDxe.dll 0x467C47000
[04:11:16.469] Loading driver at 0x00467C46000 EntryPoint=0x00467C4D3A8 AndroidBootDxe.efi

[04:11:16.470] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/PcieControllerDxe/PcieControllerDxe/DEBUG/PcieControllerDxe.dll 0x467C34000
[04:11:16.470] Loading driver at 0x00467C33000 EntryPoint=0x00467C3F240 PcieControllerDxe.efi

[04:11:16.785] PCIe Controller-1 Link is DOWN
[04:11:17.089] PCIe Controller-4 Link is DOWN
[04:11:17.392] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/SecurityPkg/Hash2DxeCrypto/Hash2DxeCrypto/DEBUG/Hash2DxeCrypto.dll 0x467C2A000
[04:11:17.445] Loading driver at 0x00467C29000 EntryPoint=0x00467C2ED9C Hash2DxeCrypto.efi

[04:11:17.445] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/EqosDeviceDxe/EqosDeviceDxe/DEBUG/EqosDeviceDxe.dll 0x467B90000
[04:11:17.446] Loading driver at 0x00467B8F000 EntryPoint=0x00467B964A8 EqosDeviceDxe.efi

[04:11:17.446] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/NonDiscoverablePciDeviceDxe/NonDiscoverablePciDeviceDxe/DEBUG/NonDiscoverablePciDeviceDxe.dll 0x467B85000
[04:11:17.496] Loading driver at 0x00467B84000 EntryPoint=0x00467B8A8CC NonDiscoverablePciDeviceDxe.efi

[04:11:17.498] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Tegra/T194/Drivers/T194GraphicsOutputDxe/T194GraphicsOutputDxe/DEBUG/T194GraphicsOutputDxe.dll 0x467B78000
[04:11:17.499] Loading driver at 0x00467B77000 EntryPoint=0x00467B7EFE4 T194GraphicsOutputDxe.efi

[04:11:17.499] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/FmpDevicePkg/FmpDxe/7C374309-1649-4682-8BEE-04F3A8399414/DEBUG/FmpDxe.dll 0x467A9B000
[04:11:17.538] Loading driver at 0x00467A9A000 EntryPoint=0x00467AA5648 FmpDxe.efi

[04:11:17.538] GetFuseSettings: Error getting TegraPlatformSpec size: Not Found
[04:11:17.539] GetTnSpec: Error getting TegraPlatformCompatSpec size: Not Found
[04:11:17.539] VerPartitionGetVersion: Crc mismatch expected=0x0, received=0xB13B1EC2
[04:11:17.540] GetVersionInfo: Failed to parse version info: Volume Corrupt
[04:11:17.540] VerPartitionGetVersion: Crc mismatch expected=0x0, received=0xB13B1EC2
[04:11:17.578] GetVersionInfo: Failed to parse version info: Volume Corrupt
[04:11:17.579] add-symbol-file /home/edk2/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciHostBridgeDxe/DEBUG/PciHostBridgeDxe.dll 0x467A88000
[04:11:17.579] Loading driver at 0x00467A87000 EntryPoint=0x00467A9157C PciHostBridgeDxe.efi

[04:11:17.865] [ext4] Needs journal recovery, mounting read-only
[04:11:18.009] Installed Fat filesystem on 4647E8E18
[04:11:23.017] NvmExpressPassThru: Timeout occurs for an NVMe command.
[04:11:23.110] ÿäUnhandled Exception in EL3.
[04:11:23.111] x30            = 0x0000000050000c50
[04:11:23.111] x0             = 0x0000000000000000
[04:11:23.111] x1             = 0x00000000be000011
[04:11:23.111] x2             = 0x0000000000000000
[04:11:23.112] x3             = 0x0000000000000011
[04:11:23.113] x4             = 0x0000000000100000
[04:11:23.113] x5             = 0x000000046e9fdf98
[04:11:23.113] x6             = 0x0000000401000000
[04:11:23.114] x7             = 0x0000000401000000
[04:11:23.114] x8             = 0x0000000000000000
[04:11:23.114] x9             = 0x0000000041223020
[04:11:23.114] x10            = 0x000000000003073d
[04:11:23.114] x11            = 0x0000000000000000
[04:11:23.154] x12            = 0x0000000000000002
[04:11:23.155] x13            = 0x0000000000000002
[04:11:23.156] x14            = 0x0000000000000001
[04:11:23.157] x15            = 0x00000000000000ff
[04:11:23.158] x16            = 0x00000004686c0dbc
[04:11:23.158] x17            = 0x0000000020a983c7
[04:11:23.158] x18            = 0x00000004686cc280
[04:11:23.158] x19            = 0x0000000000000000
[04:11:23.159] x20            = 0x00000004648190a0
[04:11:23.159] x21            = 0x0000000000000001
[04:11:23.159] x22            = 0x0000000467c3f218
[04:11:23.159] x23            = 0x0000000000000001
[04:11:23.160] x24            = 0x0000000000000001
[04:11:23.197] x25            = 0x000000046e9fe136
[04:11:23.200] x26            = 0x0000000000000002
[04:11:23.203] x27            = 0xffff0000f0000003
[04:11:23.204] x28            = 0x0000000000000002
[04:11:23.205] x29            = 0x000000046e9fdfa0
[04:11:23.206] scr_el3        = 0x000000000003073d
[04:11:23.207] sctlr_el3      = 0x0000000030cd183f
[04:11:23.207] cptr_el3       = 0x0000000000000000
[04:11:23.208] tcr_el3        = 0x0000000080823518
[04:11:23.208] daif           = 0x00000000000002c0
[04:11:23.208] mair_el3       = 0x00000000004404ff
[04:11:23.208] spsr_el3       = 0x00000000600003c9
[04:11:23.208] elr_el3        = 0x00000004686c6280
[04:11:23.209] ttbr0_el3      = 0x0000000050026341
[04:11:23.241] esr_el3        = 0x00000000be000011
[04:11:23.242] far_el3        = 0x0000000000000000
[04:11:23.243] spsr_el1       = 0x0000000000000000
[04:11:23.244] elr_el1        = 0x0000000000000000
[04:11:23.244] spsr_abt       = 0x0000000000000000
[04:11:23.244] spsr_und       = 0x0000000000000000
[04:11:23.244] spsr_irq       = 0x0000000000000000
[04:11:23.245] spsr_fiq       = 0x0000000000000000
[04:11:23.245] sctlr_el1      = 0x0000000030d00800
[04:11:23.245] actlr_el1      = 0x0000000000000000
[04:11:23.245] cpacr_el1      = 0x0000000000300000
[04:11:23.245] csselr_el1     = 0x0000000000000000
[04:11:23.245] sp_el1         = 0x0000000000000000
[04:11:23.284] esr_el1        = 0x0000000000000000
[04:11:23.285] ttbr0_el1      = 0x0000000000000000
[04:11:23.286] ttbr1_el1      = 0x0000000000000000
[04:11:23.286] mair_el1       = 0x0000000000000000
[04:11:23.287] amair_el1      = 0x0000000000000000
[04:11:23.287] tcr_el1        = 0x0000000000000000
[04:11:23.287] tpidr_el1      = 0x0000000000000000
[04:11:23.287] tpidr_el0      = 0x0000000080000000
[04:11:23.288] tpidrro_el0    = 0x0000000000000000
[04:11:23.288] par_el1        = 0x0000000000000800
[04:11:23.288] mpidr_el1      = 0x0000000081000000
[04:11:23.288] afsr0_el1      = 0x0000000000000000
[04:11:23.288] afsr1_el1      = 0x0000000000000000
[04:11:23.288] contextidr_el1 = 0x0000000000000000
[04:11:23.328] vbar_el1       = 0x0000000000000000
[04:11:23.329] cntp_ctl_el0   = 0x0000000000000005
[04:11:23.329] cntp_cval_el0  = 0x000000001e531281
[04:11:23.330] cntv_ctl_el0   = 0x0000000000000000
[04:11:23.331] cntv_cval_el0  = 0x0000000000000000
[04:11:23.331] cntkctl_el1    = 0x0000000000000000
[04:11:23.331] sp_el0         = 0x00000004686cc280
[04:11:23.332] isr_el1        = 0x0000000000000040
[04:11:23.332] cpuectlr_el1   = 0xa000000b40543000
[04:11:23.332] gicd_ispendr regs (Offsets 0x200 - 0x278)
[04:11:23.332]  Offset:                      value
[04:11:23.332] 0000000000000200:             0x0000000000000000
[04:11:23.333] 0000000000000204:             0x0000000000000000
[04:11:23.371] 0000000000000208:             0x0000000000000000
[04:11:23.372] 000000000000020c:             0x0000000000000000
[04:11:23.372] 0000000000000210:             0x0000000000000000
[04:11:23.373] 0000000000000214:             0x0000000000000000
[04:11:23.373] 0000000000000218:             0x0000000000010000
[04:11:23.373] 000000000000021c:             0x0000000000020000
[04:11:23.373] 0000000000000220:             0x0000000000000000
[04:11:23.374] 0000000000000224:             0x0000000000000000
[04:11:23.374] 0000000000000228:             0x0000000000000000
[04:11:23.374] 000000000000022c:             0x0000000000000000
[04:11:23.375] 0000000000000230:             0x0000000000000000
[04:11:23.376] 0000000000000234:             0x0000000000000000
[04:11:23.376] 0000000000000238:             0x0000000000000000
[04:11:23.415] 000000000000023c:             0x0000000000000000
[04:11:23.415] 0000000000000240:             0x0000000000000000
[04:11:23.415] 0000000000000244:             0x0000000000000000
[04:11:23.415] 0000000000000248:             0x0000000000000000
[04:11:23.416] 000000000000024c:             0x0000000000000000
[04:11:23.416] 0000000000000250:             0x0000000000000000
[04:11:23.416] 0000000000000254:             0x0000000000000000
[04:11:23.416] 0000000000000258:             0x0000000000000000
[04:11:23.416] 000000000000025c:             0x0000000000000000
[04:11:23.417] 0000000000000260:             0x0000000000000000
[04:11:23.417] 0000000000000264:             0x0000000000000000
[04:11:23.417] 0000000000000268:             0x0000000000000000
[04:11:23.417] 000000000000026c:             0x0000000000000000
[04:11:23.441] 0000000000000270:             0x0000000000000000
[04:11:23.442] 0000000000000274:             0x0000000000000000
[04:11:23.443] 0000000000000278:             0x0000000000000000

Another very similar crash that has occurred several days ago:

[15:13:20.206] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/PcieControllerDxe/PcieControllerDxe/DEBUG/PcieControllerDxe.dll 0x467C0A000
[15:13:20.207] Loading driver at 0x00467C09000 EntryPoint=0x00467C150E8 PcieControllerDxe.efi

[15:13:20.528] PCIe Controller-1 Link is DOWN
[15:13:20.832] PCIe Controller-4 Link is DOWN
[15:13:21.182] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/SecurityPkg/Hash2DxeCrypto/Hash2DxeCrypto/DEBUG/Hash2DxeCrypto.dll 0x467C00000
[15:13:21.183] Loading driver at 0x00467BFF000 EntryPoint=0x00467C04D40 Hash2DxeCrypto.efi

[15:13:21.184] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/EqosDeviceDxe/EqosDeviceDxe/DEBUG/EqosDeviceDxe.dll 0x467B66000
[15:13:21.186] Loading driver at 0x00467B65000 EntryPoint=0x00467B6C4E8 EqosDeviceDxe.efi

[15:13:21.187] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Drivers/NonDiscoverablePciDeviceDxe/NonDiscoverablePciDeviceDxe/DEBUG/NonDiscoverablePciDeviceDxe.dll 0x467B5B000
[15:13:21.227] Loading driver at 0x00467B5A000 EntryPoint=0x00467B60890 NonDiscoverablePciDeviceDxe.efi

[15:13:21.228] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/Silicon/NVIDIA/Tegra/T194/Drivers/T194GraphicsOutputDxe/T194GraphicsOutputDxe/DEBUG/T194GraphicsOutputDxe.dll 0x467B4E000
[15:13:21.230] Loading driver at 0x00467B4D000 EntryPoint=0x00467B54F8C T194GraphicsOutputDxe.efi

[15:13:21.278] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/FmpDevicePkg/FmpDxe/7C374309-1649-4682-8BEE-04F3A8399414/DEBUG/FmpDxe.dll 0x467A73000
[15:13:21.279] Loading driver at 0x00467A72000 EntryPoint=0x00467A7C354 FmpDxe.efi

[15:13:21.280] add-symbol-file /sources/bootloader/nvidia-uefi/Build/Jetson/DEBUG_GCC5/AARCH64/MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciHostBridgeDxe/DEBUG/PciHostBridgeDxe.dll 0x467A61000
[15:13:21.280] Loading driver at 0x00467A60000 EntryPoint=0x00467A6A42C PciHostBridgeDxe.efi

[15:13:21.682] LocatePciExpressCapabilityRegBlock: [01|00|00] failed to access config space at offset 0x100
[15:13:22.404] ??Unhandled Exception in EL3.
[15:13:22.405] x30            = 0x0000000050000c50
[15:13:22.405] x0             = 0x0000000000000000
[15:13:22.406] x1             = 0x00000000be000011
[15:13:22.406] x2             = 0x0000000000000000
[15:13:22.407] x3             = 0x0000000000000011
[15:13:22.407] x4             = 0x0000000000100000
[15:13:22.408] x5             = 0x000000046e9fe568
[15:13:22.408] x6             = 0x0000002001000000
[15:13:22.408] x7             = 0x0000002001000000
[15:13:22.409] x8             = 0x0000000000000000
[15:13:22.409] x9             = 0x0000000041223020
[15:13:22.409] x10            = 0x000000000003073d
[15:13:22.410] x11            = 0x000c010200000000
[15:13:22.447] x12            = 0x000000000a0341d0
[15:13:22.447] x13            = 0xff7f000000060101
[15:13:22.447] x14            = 0x00001417ffff0000
[15:13:22.447] x15            = 0x41d0000c01020000
[15:13:22.448] x16            = 0x000000046869fdb0
[15:13:22.448] x17            = 0x000000009bd90d09
[15:13:22.448] x18            = 0x00000004686ab2f0
[15:13:22.448] x19            = 0x0000000000000000
[15:13:22.448] x20            = 0x00000004647d41a0
[15:13:22.448] x21            = 0x0000000000000002
[15:13:22.448] x22            = 0x0000000467c150c0
[15:13:22.448] x23            = 0x0000000000000001
[15:13:22.448] x24            = 0x0000000000000001
[15:13:22.490] x25            = 0x0000000467a6e439
[15:13:22.491] x26            = 0x0000000000000000
[15:13:22.491] x27            = 0x000000046e9fe6e0
[15:13:22.492] x28            = 0x0000000000000004
[15:13:22.492] x29            = 0x000000046e9fe570
[15:13:22.492] scr_el3        = 0x000000000003073d
[15:13:22.493] sctlr_el3      = 0x0000000030cd183f
[15:13:22.493] cptr_el3       = 0x0000000000000000
[15:13:22.493] tcr_el3        = 0x0000000080823518
[15:13:22.494] daif           = 0x00000000000002c0
[15:13:22.494] mair_el3       = 0x00000000004404ff
[15:13:22.494] spsr_el3       = 0x00000000600002c9
[15:13:22.495] elr_el3        = 0x0000000467a63c18
[15:13:22.495] ttbr0_el3      = 0x0000000050026341
[15:13:22.534] esr_el3        = 0x00000000be000011
[15:13:22.534] far_el3        = 0x0000000000000000
[15:13:22.535] spsr_el1       = 0x0000000000000000
[15:13:22.535] elr_el1        = 0x0000000000000000
[15:13:22.536] spsr_abt       = 0x0000000000000000
[15:13:22.536] spsr_und       = 0x0000000000000000
[15:13:22.536] spsr_irq       = 0x0000000000000000
[15:13:22.537] spsr_fiq       = 0x0000000000000000
[15:13:22.537] sctlr_el1      = 0x0000000030d00800
[15:13:22.537] actlr_el1      = 0x0000000000000000
[15:13:22.538] cpacr_el1      = 0x0000000000300000
[15:13:22.538] csselr_el1     = 0x0000000000000000
[15:13:22.538] sp_el1         = 0x0000000000000000
[15:13:22.577] esr_el1        = 0x0000000000000000
[15:13:22.578] ttbr0_el1      = 0x0000000000000000
[15:13:22.578] ttbr1_el1      = 0x0000000000000000
[15:13:22.579] mair_el1       = 0x0000000000000000
[15:13:22.579] amair_el1      = 0x0000000000000000
[15:13:22.581] tcr_el1        = 0x0000000000000000
[15:13:22.582] tpidr_el1      = 0x0000000000000000
[15:13:22.582] tpidr_el0      = 0x0000000080000000
[15:13:22.582] tpidrro_el0    = 0x0000000000000000
[15:13:22.582] par_el1        = 0x0000000000000800
[15:13:22.582] mpidr_el1      = 0x0000000081000000
[15:13:22.582] afsr0_el1      = 0x0000000000000000
[15:13:22.582] afsr1_el1      = 0x0000000000000000
[15:13:22.582] contextidr_el1 = 0x0000000000000000
[15:13:22.621] vbar_el1       = 0x0000000000000000
[15:13:22.621] cntp_ctl_el0   = 0x0000000000000005
[15:13:22.622] cntp_cval_el0  = 0x0000000016309e46
[15:13:22.622] cntv_ctl_el0   = 0x0000000000000000
[15:13:22.622] cntv_cval_el0  = 0x0000000000000000
[15:13:22.623] cntkctl_el1    = 0x0000000000000000
[15:13:22.623] sp_el0         = 0x00000004686ab2f0
[15:13:22.624] isr_el1        = 0x0000000000000040
[15:13:22.624] cpuectlr_el1   = 0xa000000b40543000
[15:13:22.624] gicd_ispendr regs (Offsets 0x200 - 0x278)
[15:13:22.625]  Offset:                      value
[15:13:22.625] 0000000000000200:             0x0000000000000000
[15:13:22.625] 0000000000000204:             0x0000000000000000
[15:13:22.664] 0000000000000208:             0x0000000000000000
[15:13:22.664] 000000000000020c:             0x0000000000000000
[15:13:22.665] 0000000000000210:             0x0000000000000000
[15:13:22.665] 0000000000000214:             0x0000000000000000
[15:13:22.666] 0000000000000218:             0x0000000000010000
[15:13:22.666] 000000000000021c:             0x0000000000020000
[15:13:22.666] 0000000000000220:             0x0000000000000000
[15:13:22.667] 0000000000000224:             0x0000000000000000
[15:13:22.667] 0000000000000228:             0x0000000000000000
[15:13:22.668] 000000000000022c:             0x0000000000000000
[15:13:22.668] 0000000000000230:             0x0000000000000000
[15:13:22.668] 0000000000000234:             0x0000000000000000
[15:13:22.669] 0000000000000238:             0x0000000000000000
[15:13:22.708] 000000000000023c:             0x0000000000000000
[15:13:22.708] 0000000000000240:             0x0000000000000000
[15:13:22.708] 0000000000000244:             0x0000000000000000
[15:13:22.709] 0000000000000248:             0x0000000000000000
[15:13:22.709] 000000000000024c:             0x0000000000000000
[15:13:22.710] 0000000000000250:             0x0000000000000000
[15:13:22.710] 0000000000000254:             0x0000000000000000
[15:13:22.710] 0000000000000258:             0x0000000000000000
[15:13:22.711] 000000000000025c:             0x0000000000000000
[15:13:22.711] 0000000000000260:             0x0000000000000000
[15:13:22.712] 0000000000000264:             0x0000000000000000
[15:13:22.712] 0000000000000268:             0x0000000000000000
[15:13:22.712] 000000000000026c:             0x0000000000000000
[15:13:22.734] 0000000000000270:             0x0000000000000000
[15:13:22.734] 0000000000000274:             0x0000000000000000
[15:13:22.735] 0000000000000278:             0x0000000000000000

Here are our observations:

  • In both cases X30/LR is 0x0000000050000c50 so the routine that crashes is called from the same location in both cases.
  • The last log before the crash is different one is NvmExpressPassThru: Timeout occurs for an NVMe command. and the other one is LocatePciExpressCapabilityRegBlock: [01|00|00] failed to access config space at offset 0x100
  • When the crash occurs the PCIe Controller-4 Link is DOWNis displayed. It is not the case during a normal/working boot.

To perform the reboot cycles without reaching the kernel or user space

We modified UEFI to trigger a reboot by calling the RuntimeServiceResetSystem function in edk2/MdeModulePkg/Universal/ResetSystemRuntimeDxe/ResetSystem.c

Which in turn calls the ResetCold function in ./ArmPkg/Library/ArmSmcPsciResetSystemLib/ArmSmcPsciResetSystemLib.c which invokes the tegra_soc_prepare_system_reset via an SMC instruction

In tegra_soc_prepare_system_reset we have made a modification to clear the scratch register RSV109_lo this forces the device to always to use slot 0 no matter what. And we let it run until we see a crash. :-)

I can send our patches to replicate the boot loop.

The PCIe Controller-4 Link is DOWN message appears 20-30 times in ~5000 boot cycles. But always when there is a crash, we are going to dig deeper into this but any help will be greatly appreciated.

I am also sharing a log where the unhandled error at EL3 with the same behavior:

PCIe Controller link is down, please note in my case this is controller-1 though.
X30/LR is 0x0000000050000c50 also so this at least 3 occurences.

From the crash dump
ELR_EL3 = 0x46a305280
Exception Class (EC): 0xBE (bit:31:26) corresponds to SError interrupt

Thanks

uefi_freeze_nvme_controller_dump-el3.log (73.3 KB)

BTW, please refer to jetson-linux-r3562 page for additional files, overlay_mb1bct_35.x.tbz2, which also resolve some boot failures.

Hi,

We do have those overlays integrated in our build. In fact we first took the initial drop https://developer.nvidia.com/downloads/embedded/L4T/r35_Release_v6.2/overlay_mb1bct_35.6.2.tbz2, which for some weird reasons has MacOS and Safari files and missing overlays and then it was replaced via shadow dropping by https://developer.nvidia.com/downloads/embedded/L4T/overlay_mb1bct_35.x.tbz2

This did not help in any way in this use case thought. Can you please investigate this matter more seriously on actual hardware? Can internal also check the problematic SKUs as provided by Martin? Since the affected ratio of modules is significant and you must have access to a large range of modules for your QA, you should be able to make an attempt to reproduce it. Did internal even tried?

Thanks

Hi,

We are able to reproduce this on an official devkit (p3509) using vanilla sample BSP 35.6.2 as provided by Nvidia without any change from our end. So, both hardware and software stack are unmodified by us.

Is there any progress made by internal to make an attempt to reproduce this issue?

Anything that can be shared? Can you extract some meaningful information based on provided logs?

Thanks

hello sebastien.schertenleib,

hold-on, P3509 is Xavier NX carrier board.
as mentioned by Jetson FAQ | NVIDIA Developer,
let me re-cap as below..

Jetson Orin NX & Jetson Orin Nano series modules are not pin-compatible with Jetson Xavier NX series modules, but you can design a carrier board for the I/Os they have in common, such that both modules are supported.

Yes, we are aware of it, we are currently trying to reproduce the issue on a nano devkit but for some reason AB mode is half broken with the same set of commands (beside the target).

At first we were using profile jetson-orin-nano-devkit-nvme for orin nx on nano devkit but were told on those forums to use jetson-orin-nano-devkit instead but the behavior seems worse in the latter case.

Can you please confirm the exact profile to be used for using an orin nx module with a nano devit!

hello sebastien.schertenleib,

it’s according to developer guide, Jetson Modules and Configurations.
please use jetson-orin-nano-devkit.conf.

may I know what’s the worse behavior you’ve seen?

Well, with this profile we have:

ansible@tegra-ubuntu:~$ sudo nvbootctrl -t bootloader dump-slots-info
Current version: 35.6.2
Capsule update status: 0
Current bootloader slot: A
Active bootloader slot: A
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal
ansible@tegra-ubuntu:~$ sudo nvbootctrl -t rootfs dump-slots-info
RootFS A/B is not enabled.

However, rootfs B exist on the board. With the same commands on xavier devkit, agx orin devkit, and our own carrier board it works.

Hi,
The issue looks specific to certain Orin NX modules. Please help check if you can reproduce the issue on the module + Orin Nano carrier board with Jetpack 6.2.1. Would like to clarify whether it is specific to using the module(HW specific) instead of Jetpack version(SW).

Hello, just a quick status update, concerning the crash reported earlier. We have been testing the bootloop last week and so far we can tell the following:

  • We tried increasing or decreasing delays in PcieControllerDxe.c in the HostPrepare function - to no avail..

  • We are sure the crash does not occur before reaching the NvmeControllerInit in NvmeControllerHci.c

We are now fully focused on instrumenting NvmeControllerHci.c

So we are now able to flash an orin nx on a nano devkit with A/B for 35.6.2 and 36.4.4 an export ROOTFS_AB=1 was missing in our set of commands, surprisingly it does work for other target but nano. We are now going to stress test it to see if the same issue arise.

Is internal able to reproduce it?

There is now another dev reporting the same issue on JP6: JP6.2 orin NX UEFI reported an exception during startup - #4 by KevinFFF

I think it becomes obvious that the problem is not related to a custom board or even specific JP, as we have the same problem using official devkit and stock Nvidia BSP.

Can you let us know what have been done internally as this problem is now reported by many devs.

Hi,
We tested on r36.4.4 + Orin-NX and did not observe the issue, so suspect it is specific to certain modules.

Our test module is 699-13767-0000-300 J.1

You should use more than one module with the ratio of affected orin nx. And probably one made recently. How many reboot have you done?

We do have plenty of modules that we could potentially lend. I am going to check with our management and our Nvidia representative.

Hi, is there a document to specify the requirement of our orinnx moudle power on, such as voltage, current and so on, thanks

I got the log when the crash happened.

ASSERT [DxeCore] /home/user/edk2/edk2-docker/nvidia-uefi/edk2/MdeModulePkg/Core/Dxe/DxeMain/DxeMain.c(562���
����

how did it that ?

Is there any update on your end? Did you tried to reproduce this phenomenon using more than one Orin NX module? Did you tried to perform stress test in the thousand of reboot? Can you check where we could provide the complete hardware and software stack in order for Nvidia engineers to get a reproducible environment?

Thanks

Hi @sebastien.schertenleib if you put the specific Orin NX module to Orin Nano carrier board and flash Jetpack 6.2.1, do you observe the issue?

We don’t observe the issue with the modules we have. But we don’t have the modules with new PCN, and are not able to test the new-PCN modules. Would be great if you can help test and share us the result.

Hello,

At the moment, we are in the process to stress tests 2 problematic modules using official nano devkit on 35.6.2. We have not tried to check with 6.2.1. In any case, we need to use 35.6.2 as our products embed both agx xavier and orin nx and we want to keep them aligned and xavier are not supported in JP6.

We can share the result but this won’t help us to find a solution if you are unable to test afterward.

What would be the best approach so that Nvidia can seriously investigate this matter? Can you provide us with a location where we could send a full hardware and software stack to someone that will look at this issue in depth?

Maybe not on your end, but surely, there must be somewhere in R&D/QA departments that should be able to check those units.