Jetson nano boot process freezes randomly during the reboot stress test

JJ.C · March 4, 2022, 3:27am

Hi, We are developing our carrier board .In this process, sometimes we meet the boot hang issue during the reboot stress test. Our carrier board is refer the Xavier Devkit designed (p3509-0000). Try to figure out this issue may not depend on our SW changes. So we build the test image from SDK Manager download source (Jetpack 4.6 rev3) and placed 3 jetson nano emmc SOM (P3448-0002) with our carrier boards to do the reboot stress test at the same time. The test image source is BSP only and without install any Jetson SDK componets (ex: CUDA , Container .). We use the following command to create the EMMC image and finished the initial setup wizard after flash image , then we placed our tool to do the reboot stress test, the steps as following:

sudo ./apply_binaries.sh
sudo BOARDID=3448 BOARDSKU=0002 FAB=300 FUSELEVEL=fuselevel_production ./nvmassflashgen.sh jetson-nano-emmc mmcblk0p1
cd bootloader/mfi_jetson-nano-emmc/ && sudo ./nvmflash.sh --showlogs
plug HDMI monitor and boot up device then finished the setup wizard
remove “quiet” from /boot/extlinux/extlinux.conf for enable more console log
set up our reboot stress tool

a. create the reboot_test.sh

times=$(grep -r “reboot_times” /etc/reboot_times.txt | awk ‘{print $3}’)
case “$1” in
start)
((times+=1))
sleep 20
echo "reboot_times = "$times | sudo tee /etc/reboot_times.txt
systemctl reboot
;;
stop)
echo “Stopping reboot_test”
;;
*)
echo “Usage: /etc/init.d/reboot_test.sh {start|stop}”
exit 1
;;
esac
exit 0

b. copy reboot_test.sh under /etc/init.d/
c . sudo update-rc.d reboot_test.sh defaults
sudo update-rc.d reboot_test.sh enable
sync
sleep 5
sudo systemctl reboot

start reboot stress test

We found one device (SOM#21) have boot freeze sisuation at the 127 times, and other SOM running over 1000 times. When we power off that device(SOM#21) and to do the reboot test again. It can run over 1000 times again . I think this issue is randomly happened. The logs as following and also upload the full console log:

[    3.599483] [drm] Initialized
[    3.626027] brd: module loaded
[    3.629840] tegradc tegradc.0: fb registered
[    3.637263] loop: module loaded
[    3.640560] tegra_profiler: version: 1.145, samples/io: 49/28
[    3.640964] tegradc tegradc.0: DC initialized, skipping tegra_dc_program_mode.
[    3.641010] tegradc tegradc.0: hdmi: tmds rate:148351K prod-setting:prod_c_hdmi_75m_150m
[    3.641511] tegradc tegradc.0: hdmi: get RGB quant from REG programmed by BL.
[    3.641520] tegradc tegradc.0: hdmi: get YCC quant from REG programmed by BL.
[    3.667646] extcon-disp-state extcon:disp-state: cable 47 state 1
[    3.667650] Extcon AUX1(HDMI) enable
[    3.668661] tegradc tegradc.1: disp1 connected to head1->/host1x/sor
[    3.668737] tegradc tegradc.1: No lt-data, using default setting
[    3.668763] tegradc tegradc.1: No hpd-gpio in DT
[    3.668795] tegradc tegradc.1: DT parsed successfully
[    3.668853] tegradc tegradc.1: Display dc.ffffff800d540000 registered with id=1
[    3.670293] tegradc tegradc.1: dpd enable lookup fail:-19
[    3.674675] tegradc tegradc.1: probed
[    3.696525] tegradc tegradc.0: nominal-pclk:148351648 parent:148350781 div:1.0 pclk:148350781 146868084~161703244
[    3.734482] tegradc tegradc.1: fb registered
[    3.734574] tegra_profiler: auth: init
[    3.735161] THERMAL EST: found 2 subdevs
[    3.735165] THERMAL EST num_resources: 0
[    3.735169] [THERMAL EST subdev 0]
[    3.735173] [THERMAL EST subdev 1]
[    3.735497] thermal thermal_zone5: Registering thermal zone thermal_zone5 for type thermal-fan-est
[    3.735499] THERMAL EST: thz register success.
[    3.735610] THERMAL EST: end of probe, return err: 0
[    3.736312] sd: No Scsi addr parsed to reserve index
[    3.736338] hisi_sas: driver version v1.6
[    3.745682] libphy: Fixed MDIO Bus: probed
[    3.746247] tun: Universal TUN/TAP

l4t32.6.1.log (10.2 MB)
Have any idea to figure out this issue is depend on SOM/ Carrier board / SW or Thernal policy ? Or have any suggestion for debug this issue ?

WayneWWW · March 7, 2022, 3:05am

Do A/B test with the same module on devkit. If devkit would not have issue, then it might be your carrier board problem.
Check if the system hang in same lines each time.
Remove all peripherals on the board and see if the issue is gone. If it is, move those peripherals back one by one and see which one is causing the problem.
I am not sure why you want to use massflash directly. Maybe just use flash.sh to do the test first. The easy one is always the better one when debug such case.

JJ.C · March 7, 2022, 3:52am

Hi Wayne,

For
1, We will try it.
3. We already remove all peripherals on the board.
4. The only way I knowed is using nvmassflashgen.sh to generate EMMC image compressed file and we can easy release this file to factory side and flash it to device. May I know why using flash.sh can help debug such case ?

WayneWWW · March 7, 2022, 4:13am

The only way I knowed is using nvmassflashgen.sh to generate EMMC image compressed file and we can easy release this file to factory side and flash it to device. May I know why using flash.sh can help debug such case ?

I am not indicating “massflash has 100% bug”.
This is just to prevent any bug from nvmassflash. Flash.sh is more common to use and debug such case is not necessary to use massflash. ".

JJ.C · March 7, 2022, 6:40am

Hi Wayne,

Is there have any way to disable DVFS at boot time by modify dts files ? Thanks

WayneWWW · March 7, 2022, 6:58am

please refer to

hardware/nvidia/soc/t210/kernel-dts/tegra210-soc/tegra210-power-dvfs.dtsi

and

hardware/nvidia/platform/t210/porg/kernel-dts/porg-platforms/tegra210-porg-power-tree-p3448-0000-a00.dtsi

system · March 30, 2022, 3:39am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.