Boot hanged after MB2(TBoot-BPMP) done

We are having issues with machines restarting suddenly during normal operation. The following is the log after restart:

[0000.137] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-e75fdd51)
[0000.145] I> Boot-device: eMMC
[0000.153] I> sdmmc bdev is already initialized
[0000.157] I> pmic: reset reason (nverc)        : 0x0
[0000.190] I> Found 19 partitions in SDMMC_BOOT (instance 3)
[0000.210] I> Found 33 partitions in SDMMC_USER (instance 3)
[0000.216] W> No valid slot number is found in scratch register
[0000.222] W> Return default slot: _a
[0000.225] I> A/B: bin_type (16) slot 0
[0000.229] I> Loading partition bpmp-fw at 0xd7800000
[0000.234] I> Reading two headers - addr:0xd7800000 blocks:1
[0000.239] I> Addr: 0xd7800000, start-block: 29417480, num_blocks: 1
[0000.264] I> Binary(16) of size 534416 is loaded @ 0xd7800000
[0000.270] W> No valid slot number is found in scratch register
[0000.275] W> Return default slot: _a
[0000.279] I> A/B: bin_type (17) slot 0
[0000.282] I> Loading partition bpmp-fw-dtb at 0xd79f0000
[0000.287] I> Reading two headers - addr:0xd79f0000 blocks:1
[0000.293] I> Addr: 0xd79f0000, start-block: 29419896, num_blocks: 1
[0000.312] I> Binary(17) of size 113040 is loaded @ 0xd79e4400
[0000.340] I> Loading SCE-FW ...
[0000.343] W> No valid slot number is found in scratch register
[0000.349] W> Return default slot: _a
[0000.352] I> A/B: bin_type (12) slot 0
[0000.356] I> Loading partition sce-fw at 0xd7300000
[0000.360] I> Reading two headers - addr:0xd7300000 blocks:1
[0000.366] I> Addr: 0xd7300000, start-block: 29423992, num_blocks: 1
[0000.385] I> Binary(12) of size 125632 is loaded @ 0xd7300000
[0000.391] I> Init SCE
[0000.393] I> Loading APE-FW ...
[0000.396] W> No valid slot number is found in scratch register
[0000.402] W> Return default slot: _a
[0000.405] I> A/B: bin_type (11) slot 0
[0000.409] I> Loading partition adsp-fw at 0xd7400000
[0000.414] I> Reading two headers - addr:0xd7400000 blocks:1
[0000.419] I> Addr: 0xd7400000, start-block: 29401096, num_blocks: 1
[0000.438] I> Binary(11) of size 77216 is loaded @ 0xd7400000
[0000.444] I> Copy BTCM section
[0000.447] W> No valid slot number is found in scratch register
[0000.452] W> Return default slot: _a
[0000.456] I> A/B: bin_type (13) slot 0
[0000.459] I> Loading partition cpu-bootloader at 0x96000000
[0000.465] I> Reading two headers - addr:0x96000000 blocks:1
[0000.470] I> Addr: 0x96000000, start-block: 29380616, num_blocks: 1
[0000.492] I> Binary(13) of size 308784 is loaded @ 0x96000000
[0000.497] W> No valid slot number is found in scratch register
[0000.503] W> Return default slot: _a
[0000.506] I> A/B: bin_type (20) slot 0
[0000.510] I> Loading partition bootloader-dtb at 0x85205400
[0000.515] I> Reading two headers - addr:0x85205400 blocks:1
[0000.521] I> Addr: 0x85205400, start-block: 29382664, num_blocks: 1
[0000.541] I> Binary(20) of size 192848 is loaded @ 0x85205400
[0000.546] W> No valid slot number is found in scratch register
[0000.552] W> Return default slot: _a
[0000.555] I> A/B: bin_type (14) slot 0
[0000.559] I> Loading partition secure-os at 0x85305600
[0000.564] I> Reading two headers - addr:0x85305600 blocks:1
[0000.569] I> Addr: 0x85305600, start-block: 29384712, num_blocks: 1
[0000.592] I> Binary(14) of size 402864 is loaded @ 0x85305600
[0000.599] I> TOS boot-params @ 0x85000000
[0000.603] I> TOS params prepared
[0000.606] I> Loading EKS ...
[0000.609] I> A/B: bin_type (15) slot 0
[0000.613] I> Loading partition eks at 0x85905800
[0000.617] I> Reading two headers - addr:0x85905800 blocks:1
[0000.622] I> Addr: 0x85905800, start-block: 29397000, num_blocks: 1
[0000.641] I> Binary(15) of size 1040 is loaded @ 0x85905800
[0000.647] I> EKB detected (length: 0x400) @ 0x85905800
[0000.652] I> Copied encrypted keys
[0000.655] I> boot profiler @ 0x175844000
[0000.659] I> boot profiler for TOS @ 0x175844000
[0000.664] I> Unhalting SCE
[0000.667] I> Primary Memory Start:80000000 Size:70000000
[0000.672] I> Extended Memory Start:f0110000 Size:856f0000
[0002.518] I> MB2(TBoot-BPMP) done

It can be seen that after printing

MB2(TBoot-BPMP) done

, the system hangs and does not continue to boot.

As a comparison, next we power off the board that cannot be booted and then power it on again. The log in the boot phase is as follows:

[0000.208] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-e75fdd51)
[0000.216] I> Boot-device: eMMC
[0000.224] I> sdmmc bdev is already initialized
[0000.228] I> pmic: reset reason (nverc)        : 0x50
[0000.261] I> Found 19 partitions in SDMMC_BOOT (instance 3)
[0000.281] I> Found 33 partitions in SDMMC_USER (instance 3)
[0000.287] W> No valid slot number is found in scratch register
[0000.293] W> Return default slot: _a
[0000.296] I> A/B: bin_type (16) slot 0
[0000.300] I> Loading partition bpmp-fw at 0xd7800000
[0000.305] I> Reading two headers - addr:0xd7800000 blocks:1
[0000.310] I> Addr: 0xd7800000, start-block: 29417480, num_blocks: 1
[0000.335] I> Binary(16) of size 534416 is loaded @ 0xd7800000
[0000.340] W> No valid slot number is found in scratch register
[0000.346] W> Return default slot: _a
[0000.349] I> A/B: bin_type (17) slot 0
[0000.353] I> Loading partition bpmp-fw-dtb at 0xd79f0000
[0000.358] I> Reading two headers - addr:0xd79f0000 blocks:1
[0000.364] I> Addr: 0xd79f0000, start-block: 29419896, num_blocks: 1
[0000.383] I> Binary(17) of size 113040 is loaded @ 0xd79e4400
[0000.411] I> Loading SCE-FW ...
[0000.414] W> No valid slot number is found in scratch register
[0000.420] W> Return default slot: _a
[0000.423] I> A/B: bin_type (12) slot 0
[0000.427] I> Loading partition sce-fw at 0xd7300000
[0000.431] I> Reading two headers - addr:0xd7300000 blocks:1
[0000.437] I> Addr: 0xd7300000, start-block: 29423992, num_blocks: 1
[0000.456] I> Binary(12) of size 125632 is loaded @ 0xd7300000
[0000.461] I> Init SCE
[0000.464] I> Loading APE-FW ...
[0000.467] W> No valid slot number is found in scratch register
[0000.473] W> Return default slot: _a
[0000.476] I> A/B: bin_type (11) slot 0
[0000.480] I> Loading partition adsp-fw at 0xd7400000
[0000.484] I> Reading two headers - addr:0xd7400000 blocks:1
[0000.490] I> Addr: 0xd7400000, start-block: 29401096, num_blocks: 1
[0000.509] I> Binary(11) of size 77216 is loaded @ 0xd7400000
[0000.515] I> Copy BTCM section
[0000.518] W> No valid slot number is found in scratch register
[0000.523] W> Return default slot: _a
[0000.527] I> A/B: bin_type (13) slot 0
[0000.530] I> Loading partition cpu-bootloader at 0x96000000
[0000.536] I> Reading two headers - addr:0x96000000 blocks:1
[0000.541] I> Addr: 0x96000000, start-block: 29380616, num_blocks: 1
[0000.562] I> Binary(13) of size 308784 is loaded @ 0x96000000
[0000.568] W> No valid slot number is found in scratch register
[0000.574] W> Return default slot: _a
[0000.577] I> A/B: bin_type (20) slot 0
[0000.581] I> Loading partition bootloader-dtb at 0x85205400
[0000.586] I> Reading two headers - addr:0x85205400 blocks:1
[0000.592] I> Addr: 0x85205400, start-block: 29382664, num_blocks: 1
[0000.611] I> Binary(20) of size 192848 is loaded @ 0x85205400
[0000.617] W> No valid slot number is found in scratch register
[0000.623] W> Return default slot: _a
[0000.626] I> A/B: bin_type (14) slot 0
[0000.630] I> Loading partition secure-os at 0x85305600
[0000.635] I> Reading two headers - addr:0x85305600 blocks:1
[0000.640] I> Addr: 0x85305600, start-block: 29384712, num_blocks: 1
[0000.662] I> Binary(14) of size 402864 is loaded @ 0x85305600
[0000.670] I> TOS boot-params @ 0x85000000
[0000.674] I> TOS params prepared
[0000.677] I> Loading EKS ...
[0000.679] I> A/B: bin_type (15) slot 0
[0000.683] I> Loading partition eks at 0x85905800
[0000.688] I> Reading two headers - addr:0x85905800 blocks:1
[0000.693] I> Addr: 0x85905800, start-block: 29397000, num_blocks: 1
[0000.712] I> Binary(15) of size 1040 is loaded @ 0x85905800
[0000.717] I> EKB detected (length: 0x400) @ 0x85905800
[0000.722] I> Copied encrypted keys
[0000.726] I> boot profiler @ 0x175844000
[0000.730] I> boot profiler for TOS @ 0x175844000
[0000.735] I> Unhalting SCE
[0000.737] I> Primary Memory Start:80000000 Size:70000000
[0000.743] I> Extended Memory Start:f0110000 Size:856f0000
[0000.749] I> MB2(TBoot-BPMP) done

NOTICE:  BL31: v1.3(release):b5eeb33f7
NOTICE:  BL31: Built : 08:55:30, Feb 19 2022
ipc-unittest-main: 1519: Welcome to IPC unittest!!!
ipc-unittest-main: 1531: waiting forever
ipc-unittest-srv: 329: Init unittest services!!!
hwkey-agent: 41: hwkey-agent is running!!
hwkey-agent: 347: key_mgnt_processing .......
hwkey-agent: 255: Setting EKB key 0 to slot 14
hwkey-agent: 178: Init hweky-agent services!!
luks-srv: 40: luks-srv is running!!
luks-srv: 157: Init luks-srv IPC services!!
platform_bootstrap_epilog: trusty bootstrap complete
[0000.923] I> Welcome to Cboot
[0000.925] I> Cboot Version: t186-704e62f2
[0000.929] I> CPU-BL Params @ 0x175800000
[0000.933] I>  0) Base:0x00000000 Size:0x00000000
[0000.937] I>  1) Base:0x177f00000 Size:0x00100000
[0000.942] I>  2) Base:0x177e00000 Size:0x00100000
[0000.946] I>  3) Base:0x177d00000 Size:0x00100000
[0000.951] I>  4) Base:0x177c00000 Size:0x00100000
[0000.955] I>  5) Base:0x177b00000 Size:0x00100000
[0000.960] I>  6) Base:0x177800000 Size:0x00200000
[0000.964] I>  7) Base:0x177400000 Size:0x00400000
[0000.969] I>  8) Base:0x177a00000 Size:0x00100000
[0000.973] I>  9) Base:0x177300000 Size:0x00100000
[0000.978] I> 10) Base:0x176800000 Size:0x00800000
[0000.983] I> 11) Base:0x30000000 Size:0x00040000
[0000.987] I> 12) Base:0xf0000000 Size:0x00100000
[0000.991] I> 13) Base:0x30040000 Size:0x00001000
[0000.996] I> 14) Base:0x30048000 Size:0x00001000
[0001.000] I> 15) Base:0x30049000 Size:0x00001000
[0001.005] I> 16) Base:0x3004a000 Size:0x00001000
[0001.009] I> 17) Base:0x3004b000 Size:0x00001000
[0001.014] I> 18) Base:0x3004c000 Size:0x00001000
[0001.018] I> 19) Base:0x3004d000 Size:0x00001000
[0001.022] I> 20) Base:0x3004e000 Size:0x00001000
[0001.027] I> 21) Base:0x3004f000 Size:0x00001000
[0001.031] I> 22) Base:0x00000000 Size:0x00000000
[0001.036] I> 23) Base:0xf0100000 Size:0x00010000
[0001.040] I> 24) Base:0x00000000 Size:0x00000000
[0001.045] I> 25) Base:0x00000000 Size:0x00000000
[0001.049] I> 26) Base:0x00000000 Size:0x00000000
[0001.053] I> 27) Base:0x00000000 Size:0x00000000
[0001.058] I> 28) Base:0x84400000 Size:0x00400000
[0001.062] I> 29) Base:0x30000000 Size:0x00010000
[0001.067] I> 30) Base:0x178000000 Size:0x08000000
[0001.071] I> 31) Base:0x00000000 Size:0x00000000
[0001.076] I> 32) Base:0x176000000 Size:0x00600000
[0001.080] I> 33) Base:0x80000000 Size:0x70000000
[0001.085] I> 34) Base:0xf0110000 Size:0x856f0000
[0001.089] I> 35) Base:0x00000000 Size:0x00000000
[0001.093] I> 36) Base:0x00000000 Size:0x00000000
[0001.098] I> 37) Base:0x1772e0000 Size:0x00020000
[0001.102] I> 38) Base:0x84000000 Size:0x00400000
[0001.107] I> 39) Base:0x96000000 Size:0x02000000
[0001.111] I> 40) Base:0x85000000 Size:0x01200000
[0001.116] I> 41) Base:0x175800000 Size:0x00500000
[0001.120] I> 42) Base:0x00000000 Size:0x00000000
[0001.125] I> 43) Base:0x00000000 Size:0x00000000
[0001.129] GIC-SPI Target CPU: 4
[0001.132] Interrupts Init done
[0001.136] calling constructors
[0001.139] initializing heap
[0001.142] initializing threads
[0001.145] initializing timers
[0001.148] creating bootstrap completion thread
[0001.153] top of bootstrap2()
[0001.156] CPU: ARM Cortex A57
[0001.159] CPU: MIDR: 0x411FD073, MPIDR: 0x80000100
[0001.164] initializing platform
[0001.168] I> Bl_dtb @0x85205400
[0001.170] I> gpio framework initialized
[0001.176] I> tegrabl_gpio_driver_register: register 'nvidia,tegra186-gpio' driver
[0001.184] I> tegrabl_gpio_driver_register: register 'nvidia,tegra186-gpio-aon' driver
[0001.192] I> GPIO framework and drivers are initialized.
[0001.197] I> Boot-device: eMMC
[0001.204] I> sdmmc bdev is already initialized
[0001.235] I> Found 19 partitions in SDMMC_BOOT (instance 3)
[0001.253] I> Found 33 partitions in SDMMC_USER (instance 3)
[0001.258] W> opt-in fuse is not set, skip fuse_burning
[0001.263] I> Reserved memory at 0xfbe00000 for U-Boot relocation
[0001.269] W> No valid slot number is found in scratch register
[0001.274] W> Return default slot: _a
[0001.284] I> A/B: bin_type (21) slot 0
[0001.287] I> Loading kernel-dtb from partition
[0001.292] I> Loading partition kernel-dtb at 0x80000000 from device(0x1)
[0001.309] I> Kernel_dtb @0x80000000
[0001.312] I> tegrabl_tca9539_init: i2c bus: 0, slave addr: 0xee
[0001.320] W> fetch_driver_phandle_from_dt: failed to get node with compatible ti,tca9539
[0001.330] W> fetch_driver_phandle_from_dt: failed to get node with compatible nxp,tca9539
[0001.338] W> tegrabl_tca9539_init: failed to fetch phandle from dt
[0001.344] I> tegrabl_tca9539_init: i2c bus: 0, slave addr: 0xe8
[0001.352] W> fetch_driver_phandle_from_dt: failed to get node with compatible ti,tca9539
[0001.362] W> fetch_driver_phandle_from_dt: failed to get node with compatible nxp,tca9539
[0001.370] W> tegrabl_tca9539_init: failed to fetch phandle from dt
[0001.378] I> fixed regulator driver initialized
[0001.400] I> register 'maxim' power off handle
[0001.405] I> virtual i2c enabled
[0001.408] I> registered 'maxim,max77620' pmic
[0001.412] I> tegrabl_gpio_driver_register: register 'max77620-gpio' driver
[0001.422] E> failed to read label property for node 149556: 13
[0001.429] E> failed to read reg property for node 149652: 13
[0001.436] E> failed to read reg property for node 149704: 13
[0001.443] E> failed to read label property for node 149788: 13
[0001.449] E> failed to read reg property for node 149856: 13
[0001.456] E> failed to read reg property for node 149928: 13
[0001.464] I> Find /i2c@c250000's alias i2c7
[0001.468] I> Reading eeprom i2c=7 address=0x50
[0001.497] I> Device at /i2c@c250000:0x50
[0001.501] I> Reading eeprom i2c=7 address=0x57
[0001.505] E> I2C: slave not found in slaves.
[0001.509] E> I2C: Could not write 0 bytes to slave: 0x00ae with repeat start true.
[0001.517] E> I2C_DEV: Failed to send register address 0x00000000.
[0001.523] E> I2C_DEV: Could not read 256 registers of size 1 from slave 0xae at 0x00000000 via instance 7.
[0001.532] E> eeprom: Failed to read I2C slave device
[0001.537] I> Eeprom read failed 0x3526070d
[0001.541] I> Find /i2c@3160000's alias i2c0
[0001.545] I> Reading eeprom i2c=0 address=0x50
[0001.550] E> I2C: slave not found in slaves.
[0001.554] E> I2C: Could not write 0 bytes to slave: 0x00a0 with repeat start true.
[0001.562] E> I2C_DEV: Failed to send register address 0x00000000.
[0001.568] E> I2C_DEV: Could not read 256 registers of size 1 from slave 0xa0 at 0x00000000 via instance 0.
[0001.577] E> eeprom: Failed to read I2C slave device
[0001.582] I> Eeprom read failed 0x3526070d
[0001.587] I> Find /i2c@3180000's alias i2c2
[0001.591] I> Reading eeprom i2c=2 address=0x54
[0001.595] I> Enabling gpio chip_id = 2, gpio pin = 9
[0001.600] C> GPIO driver for chip_id 0x2 could not be found
[0001.605] E> cam_eeprom_read: Can't get gpio driver
[0001.610] I> Eeprom read failed 0x4d4d000d
[0001.614] I> Reading eeprom i2c=2 address=0x57
[0001.618] I> Enabling gpio chip_id = 2, gpio pin = 9
[0001.623] C> GPIO driver for chip_id 0x2 could not be found
[0001.628] E> cam_eeprom_read: Can't get gpio driver
[0001.633] I> Eeprom read failed 0x4d4d000d
[0001.637] I> create_pm_ids: id: 3636-0001-301-D, len: 15
[0001.642] I> config: mem-type:00,power-config:00,misc-config:00,modem-config:00,touch-config:00,display-config:00,, len: 93
[0001.667] I> regulator 'vdd-hdmi-5v0' already enabled
[0001.678] I> regulator 'vdd-hdmi-5v0' already enabled
[0001.683] I> hdmi cable connected
[0001.687] I> setting 'vdd-pex-1v00' regulator to 1000000 micro volts
[0001.695] I> setting 'vdd-1v8' regulator to 1800000 micro volts
[0001.701] I> retrieved tmds range from prod_list_hdmi_soc
[0001.707] E> cannot find any other nvdisp nodes
[0001.727] I> edid read success
[0001.742] I> edid read success
[0001.745] I> width = 640, height = 480, frequency = 25174825
[0001.750] I> width = 640, height = 480, frequency = 25174825
[0001.756] I> width = 640, height = 480, frequency = 25174825
[0001.761] I> width = 640, height = 480, frequency = 25174825
[0001.767] I> width = 1920, height = 1080, frequency = 148500000
[0001.773] I> width = 720, height = 480, frequency = 27000000
[0001.778] I> width = 1920, height = 1080, frequency = 148351648
[0001.784] I> width = 1920, height = 1080, frequency = 148351648
[0001.790] I> width = 1280, height = 720, frequency = 74175824
[0001.795] I> width = 1280, height = 720, frequency = 74175824
[0001.801] I> width = 720, height = 480, frequency = 26973026
[0001.806] I> width = 720, height = 576, frequency = 26973026
[0001.812] I> width = 720, height = 480, frequency = 26973026
[0001.817] I> width = 720, height = 576, frequency = 26973026
[0001.823] I> width = 640, height = 480, frequency = 25174825
[0001.828] I> Best mode Width = 1920, Height = 1080, freq = 148351648
[0001.837] I> hdmi_enable, starting HDMI initialisation
[0001.845] I> hdmi_enable, HDMI initialisation complete
[0001.858] initializing target
[0001.861] calling apps_init()
[0001.864] starting app kernel_boot_app
[0001.886] I> found decompressor handler: lz4-legacy
[0001.891] I> decompressing BMP blob ...
[0001.903] I> Kernel type = Normal
[0001.906] I> ########## Fixed storage boot ##########
[0001.911] I> Loading kernel-bootctrl from partition
[0001.915] I> Loading partition kernel-bootctrl at 0xa8000000 from device(0x1)
[0001.930] W> tegrabl_get_kernel_bootctrl: magic number(0x00000000) is invalid
[0001.937] W> tegrabl_get_kernel_bootctrl: use default dummy boot control data
[0001.944] W> No valid slot number is found in scratch register
[0001.949] W> Return default slot: _a
[0001.953] I> A/B: bin_type (24) slot 0
[0001.968] I> Boot image size read from image header: 99035
[0001.974] I> Boot image load address: 0x80400000
[0001.978] I> Loading kernel from partition
[0001.982] I> Loading partition kernel at 0x80400000 from device(0x1)
[0002.925] I> Validate kernel ...
[0002.928] I> T18x: Authenticate kernel (bin_type 24), max size 0x4000000
[0002.936] I> Decrypt the buffer ... [0002.939] W> tegrabl_decrypt_block: fuse (0x0) is not burnt to do encryption (0x4); skip decryption.
[0002.948] I> done
[0002.950] I> Checking boot.img header magic ... [0002.954] I> [OK]
[0002.956] I> kernel-dtb is already loaded
[0002.960] I> Validate kernel-dtb ...
[0002.963] I> T18x: Authenticate kernel-dtb (bin_type 21), max size 0x100000
[0002.970] I> Decrypt the buffer ... [0002.973] W> tegrabl_decrypt_block: fuse (0x0) is not burnt to do encryption (0x4); skip decryption.
[0002.982] I> done
[0002.984] I> Kernel hdr @0x80400000
[0002.987] I> Kernel dtb @0x80000000
[0002.991] I> decompressor handler not found
[0002.995] I> Copying kernel image (626741 bytes) from 0x80400800 to 0x80600000 ... [0003.002] I> Done
[0003.004] I> Move ramdisk (len: 0) from 0x8049a000 to 0x947d0000
[0003.011] I> Updated bpmp info to DTB
[0003.016] I> Ramdisk: Base: 0x947d0000; Size: 0x0
[0003.020] I> Updated initrd info to DTB
[0003.024] W> WARN: Fail to override "console=none" in commandline
[0003.030] I> Active rootfs suffix:
[0003.033] E> tegrabl_linuxboot_add_disp_param, du 1 failed to get display params
[0003.041] E> tegrabl_linuxboot_add_disp_param, du 1 failed to get display params
[0003.048] I> disabled_core_mask: 0xffffff0c
[0003.052] W> No valid slot number is found in scratch register
[0003.057] W> Return default slot: _a
[0003.061] I> Active slot suffix:
[0003.064] I> add_boot_slot_suffix: slot_suffix =
[0003.068] I> Linux Cmdline: console=ttyS0,115200 root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2  video=tegrafb earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x1772e0000 gpt rootfs.slot_suffix= tegra_fbmem=0x800000@0x96085000 lut_mem=0x2008@0x96081000 usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 no_console_suspend boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x175840000 sdhci_tegra.en_boot_part_access=1
[0003.116] I> Updated bootarg info to DTB
[0003.119] W> MAC addr invalid!
[0003.122] E> Failed to get WIFI MAC address
[0003.126] W> MAC addr invalid!
[0003.129] E> Failed to get Bluetooth MAC address
[0003.133] I> eeprom_get_mac_addr: MAC (type: 2): 48:b0:2d:88:6d:06
[0003.140] E> Found no plugin manager ids in source DT
[0003.144] W> Add plugin manager ids from board info
[0003.149] W> "plugin-manager" doesn't exist, creating
[0003.154] W> "ids" doesn't exist, creating
[0003.158] W> "connection" doesn't exist, creating
[0003.163] W> "configs" doesn't exist, creating
[0003.167] I> create_pm_ids: id: 3636-0001-301-D, len: 15
[0003.172] I> config: mem-type:00,power-config:00,misc-config:00,modem-config:00,touch-config:00,display-config:00,, len: 93
[0003.183] I> Adding plugin-manager/ids/3636-0001-301=/i2c@c250000:module@0x50
[0003.190] W> "i2c@c250000" doesn't exist, creating
[0003.195] W> "module@0x50" doesn't exist, creating
[0003.201] I> Adding plugin-manager/ids/3636-0001-301-D
[0003.207] I> Adding plugin-manager/configs/3636-mem-type 00
[0003.212] I> Adding plugin-manager/configs/3636-power-config 00
[0003.218] I> Adding plugin-manager/configs/3636-misc-config 00
[0003.224] I> Adding plugin-manager/configs/3636-modem-config 00
[0003.230] I> Adding plugin-manager/configs/3636-touch-config 00
[0003.236] I> Adding plugin-manager/configs/3636-display-config 00
[0003.242] I> Adding plugin-manager/cvm
[0003.246] W> "chip-id" doesn't exist, creating
[0003.250] I> Adding plugin-manager/chip-id/A02P
[0003.254] W> "odm-data" doesn't exist, creating
[0003.259] I> Adding /chosen/plugin-manager/odm-data
[0003.267] I> added [base:0x80000000, size:0x70000000] to /memory
[0003.273] I> added [base:0xf0200000, size:0x85600000] to /memory
[0003.279] I> added [base:0x175e00000, size:0x200000] to /memory
[0003.284] I> added [base:0x176600000, size:0x200000] to /memory
[0003.290] I> added [base:0x177000000, size:0x200000] to /memory
[0003.296] I> Updated memory info to DTB
[0003.300] E> add_disp_param: failed to get display params for du=1
[0003.307] W> "reset" doesn't exist, creating
[0003.311] W> "pmc-reset-reason" doesn't exist, creating
[0003.317] W> "pmic-reset-reason" doesn't exist, creating
[0003.323] I> Adding ecid(0000000164461204140000000bfe8340) to DT
[0003.328] I> disabled_core_mask: 0xffffff0c
[0003.337] I> Add serial number:1422822077537 as DT property
[0003.344] I> Plugin-manager override starting
[0003.349] I> node /plugin-manager/fragement@0 matches
[0003.357] I> node /plugin-manager/fragement@3 matches
[0003.370] I> Disable plugin-manager status in FDT
[0003.375] I> Plugin-manager override finished successfully
[0003.380] I> tegrabl_load_kernel_and_dtb: Done
[0003.425] I> Kernel EP: 0x80600000, DTB: 0x80000000


U-Boot 2020.04-g4335beb692 (Feb 19 2022 - 08:55:34 -0800)

SoC: tegra186
Model: NVIDIA P3636-0001
Board: NVIDIA P3636-0001
DRAM:  3.8 GiB
MMC:   sdhci@3400000: 1, sdhci@3460000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Net:
Warning: ethernet@2490000 using MAC address from ROM
eth0: ethernet@2490000
Hit any key to stop autoboot:  0
Card did not respond to voltage select!
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
1402 bytes read in 30 ms (44.9 KiB/s)
L4T boot options
1:      Primary kernel
2:      SSD primary kernel
3:      SD primary kernel
Enter choice: 2:        SSD primary kernel
Retrieving file: /boot/initrd
7238358 bytes read in 201 ms (34.3 MiB/s)
Retrieving file: /boot/Image
34680840 bytes read in 841 ms (39.3 MiB/s)
append: console=ttyS0,115200 root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2  video=tegrafb earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x1772e0000 gpt rootfs.slot_suffix= tegra_fbmem=0x800000@0x96085000 lut_mem=0x2008@0x96081000 usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 no_console_suspend boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x175840000 sdhci_tegra.en_boot_part_access=1 quiet root=/dev/nvme0n1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=11-12
## Flattened Device Tree blob at 80000000
   Booting using the fdt blob at 0x80000000
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
   Using Device Tree in place at 0000000080000000, end 00000000800340eb
copying carveout for /host1x@13e00000/display-hub@15200000/display@15200000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15210000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15220000...

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x100
[    0.000000] Linux version 4.9.253-tegra (paul@paul-IdeaCentre-GeekPro-14AMR) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #10 SMP PREEMPT Wed Nov 2 18:40:46 CST 2022
[    0.000000] Boot CPU: AArch64 Processor [411fd073]
[    0.000000] OF: fdt:memory scan node memory@80000000, reg size 80,
[    0.000000] OF: fdt: - 80000000 ,  70000000
[    0.000000] OF: fdt: - f0200000 ,  85600000
[    0.000000] OF: fdt: - 175e00000 ,  200000
[    0.000000] OF: fdt: - 176600000 ,  200000
[    0.000000] OF: fdt: - 177000000 ,  200000
[    0.000000] earlycon: uart8250 at MMIO32 0x0000000003100000 (options '')
[    0.000000] bootconsole [uart8250] enabled
[    0.000000] Found tegra_fbmem: 00800000@96085000
[    0.000000] Found lut_mem: 00002008@96081000
[    0.455819] arm__alloc_iova_at():136: iova alloc don't match, dh=0x0000000096085000, da=0x0000000095881000
[    0.455844] arm__alloc_iova_at():136: iova alloc don't match, dh=0x0000000096081000, da=0x000000009607e000
[    3.363982] ### TEST123 in imx219_probe.
[    3.397093] ### TEST123 in imx219_probe.
[    3.426950] imx219 10-0010: imx219_board_setup: error during i2c read probe (-121)
[    3.435558] imx219 10-0010: board setup failed
[    4.751321] cgroup: cgroup2: unknown option "nsdelegate"
[    5.495647] EXT4-fs error (device nvme0n1): ext4_lookup:1599: inode #7602177: comm systemd-tmpfile: iget: checksum invalid
[    5.508145] EXT4-fs error (device nvme0n1): ext4_lookup:1599: inode #7602177: comm systemd-tmpfile: iget: checksum invalid
[    6.236142] ------------[ cut here ]------------
[    6.242158] WARNING: CPU: 0 PID: 3345 at /dvs/git/dirty/git-master_linux/kernel/kernel-4.9/drivers/spi/spidev.c:767 0xffffff80011772a8
[    6.242224] ---[ end trace 87aa0a5f032d16de ]---
[    6.243162] ------------[ cut here ]------------
[    6.243169] WARNING: CPU: 0 PID: 3345 at /dvs/git/dirty/git-master_linux/kernel/kernel-4.9/drivers/spi/spidev.c:767 0xffffff80011772a8
[    6.243222] ---[ end trace 87aa0a5f032d16df ]---
[    6.261989] gevfilter: loading out-of-tree module taints kernel.
[    6.384033] using random self ethernet address
[    6.449809] using random host ethernet address
[    6.958364] using random self ethernet address
[    6.970216] using random host ethernet address
[   18.504516] Bridge firewalling registered

The above is the log of successful boot. Observing the two logs, I have the following questions:

  1. I would like to know why the boot fails, and what are the possible reasons for the boot failure?
  2. I noticed that the log when the boot fails is pmic: reset reason (nverc): 0x0, the log when the boot is successful is pmic: reset reason (nverc): 0x50, what do the half and half of 0x0 and 0x50 here represent?

Looking forward to hearing back, thank you very much!

Hi shohokuooo,

Are you using the devkit or custom board for TX2?
What’s the Jetpack version in use?

Could you help to share the result of the following commands on your board?

$ cat /proc/device-tree/chosen/reset/pmic-reset-reason/reason 
$ cat /proc/device-tree/chosen/reset/pmic-reset-reason/register-value 
$ cat /proc/device-tree/chosen/reset/pmc-reset-reason/reset-source 

It is true that we are not using the devkit board and the version of Jetpack is not the latest but 4.6.1, but at present we are still trying to find the minimum environment to satisfy the problem recurrence. After the problem recurs, we will reply to the content of the above three files.
In addition, can you tell me what is the possibility that after printing

MB2(TBoot-BPMP) done

in the boot stage, it stopped and did not continue to print

NOTICE: BL31: v1.3(release):b5eeb33f7

? Thanks.

You could use the latest JP4.6.4 (R32.7.4) to verify.

What’s the fail rate with this unexpected reset issue?
Do you run any program on your board when you hit this issue?

After MB2 done, BL31/Cboot should be loaded and keep booting.

What’s the fail rate with this unexpected reset issue?
Do you run any program on your board when you hit this issue?

So far we have found that running specific computing tasks can reproduce this problem. This problem occurs after about 2 hours of operation. But it works fine on other boards.

After MB2 done, BL31/Cboot should be loaded and keep booting.

What possible reasons will prevent BL31/Cboot from continuing to boot?

$ cat /proc/device-tree/chosen/reset/pmic-reset-reason/reason
$ cat /proc/device-tree/chosen/reset/pmic-reset-reason/register-value
$ cat /proc/device-tree/chosen/reset/pmc-reset-reason/reset-source

reason: NIL_OR_MORE_THAN_1_BIT
register-value: 0x50
reset-source: SYS_RESET_N
The contents of the above files were obtained after restarting and then boot failure due to unknown reasons, and had to be powered off and on again to enter the system.

Do you run any program on your board when you hit this issue?

I would like to add that I only use iperf for the packet filling test, and this problem was reproduced in less than an hour. We also tried not to run any programs before, and after more than ten hours after the system was started, it still worked normally without any problems.

Is there any serial console logs during the issue occur?
Is it the reboot triggered by software or the reset triggered by hardware?

Do you mean only one board would hit this issue?

I have no idea about this at the moment. It’s more like hardware issue if you hang at the end of MB2.

Is there any serial console logs during the issue occur?

There is no particularly obvious error printing in the console logs. After the system is booted, I can see some error logs of the ext4 file system. Because we are using SSD, it seems that these error messages are within the normal range. If we continue to run other programs later, this error will not be printed again.

Is it the reboot triggered by software or the reset triggered by hardware?

I don’t think it’s a software-induced reboot, since we haven’t written anything like that in our own code. As for whether it is caused by hardware, we are also doing cross-testing. Yesterday, when we used the stress tool to test, we found that the SSD entered a read-only mode, so now we suspect that it is caused by the SSD. We are currently testing with another SSD to see if the problem will recur.

Do you mean only one board would hit this issue?

Yes, only one board has this problem so far.

I have no idea about this at the moment. It’s more like a hardware issue if you hang at the end of MB2.

Thanks for your reply, it seems this problem is not common.

To add, there was another abnormal restart just now, using dmesg -wH to monitor, I can see the following logs before the restart:

[  316.490013] EXT4-fs (nvme0n1): error count since last fsck: 12
[  316.496132] EXT4-fs (nvme0n1): initial error at time 166631408
[  316.504978] EXT4-fs (nvme0n1): last error at time 1689816795: 
[7月20 09:38] EXT4-fs (nvme0n1): error count since last fsck: 125
[  +0.006119] EXT4-fs (nvme0n1): initial error at time 1666314086
[  +0.008846] EXT4-fs (nvme0n1): last error at time 1689816795: e


[  954.989932] EXT4-fs error (device nvme0n1): ext4_lookup:1599: inode #7602177: comm systemd-tmpfile: iget: checksum invalid
[7月20 09:49] EXT4-fs error (device nvme0n1): ext4_lookup:1599: inode #7602177: comm systemd-tmpfile: iget: checksum invalid

[0000.137] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-e75fdd51)
...

Because a SSD has been replaced, it seems that some hardware caused the SSD to go abnormal. It may be the core board or the carrier board. I will continue to verify it here.

Do you mean the you are doing the hot-swap for NVMe SSD and it trigger a reset?

I don’t know the original cause, but at this point the filesystem is too corrupt to repair. If there was any error on the original system from which another drive was cloned, then it would be part of the clone.

Note that if you are using two different disks, and if there is some difference in boot content pointing into those disks, then this might cause issues. If using an initrd, then any changes you made for switching disks might need changes to the initrd as well. For example, if a UUID or PUID is used for mounting, and if the two disks differ in any way, then that might be a problem. I don’t know what is in the initrd, but things are more complicated in that case. It is very low probability that the hardware itself had anything to do with the ext4 corruption, but that much corruption is definitely not normal. The best you could do with that is to clone it to a host PC and investigate the “lost+found/” fragments.

I don’t think so, because judging from the operating conditions at that time, the memory was sufficient, and a large-scale swap was not required for exchange.

Thank you for your reply, I don’t care about the files in the SSD, and I won’t use this SSD to clone to other SSD. The current situation is that the core board can be powered on normally, the system starts up normally, and can also perform tasks. It’s just that after running for a period of time, there will be an abnormal restart occurrence. Before rebooting, you can see an error in the file system such as EXT4 in the UART log. And it will fail during the reboot process, after the MB2 done.
If I plug another core board into this carrier board and still use this SSD, then this problem will not occur.

In that case you have to figure out why it failing to properly close the filesystem before shutting down (e.g., power failure). I am reminded of the old joke: “The guy goes to the doctor, and says ‘it hurts when I do this’. The doctor replies, ‘then don’t do that’.”

In this case there is some truth in the joke. Having a detailed serial console log when the shutdown occurs might be a way to find out why the filesystem is not properly flushed, set to read-only, and unmounted (e.g., perhaps there is a hung process keeping the filesystem from unmounting, although this would not stop flush).

You might find these of interest:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.