Reboot failure.

ENewnham · March 13, 2015, 6:27pm

I have a tegra K1 chip that seems to bug out on reboot when the chip has been playing video.

I use this command to play a video:

$ gst-launch-0.10 filesrc location=<filename.mp4> ! qtdemux name=demux
demux.video_00 ! queue ! nv_omx_h264dec ! nv_omx_hdmi_videosink -e

then let the chip warm up, then simply reboot

$ reboot
[ 67.911308] Restarting system.
[ 67.914399] Restarting Linux version 3.10.40-ged4f697 (enewnham@arch) (gcc ve
rsion 4.8.3 20140401 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.04 - Lina
ro GCC 4.8-2014.04) ) #1 SMP PREEMPT Wed Mar 11 14:25:40 EDT 2015
[ 67.914399]

Then this error crops up:

No valid FDT found - please append one to U-Boot binary, use u-boot-dtb.bin or define CONFIG_OF_EMBED. For sandbox, use -d <file.dtb> initcall sequence 83de1db0 failed at call 83dc388c ### ERROR ### Please RESET the board ###

We have seen this issue before on R21.2, however U-Boot hangs without the error message. Now on R21.3 this error starts to crop up. To exacerbate the issue, you can unplug the fan to allow the chip to warm up more.

Thanks!

linuxdev · March 13, 2015, 8:50pm

Just as a test, do you have the same result with “shutdown -r now” as with “reboot”? I doubt it differs but it helps to have baseline data.

In your /boot/extlinux/extlinux.conf file, what is the FDT line? An example would be:

FDT /boot/tegra124-jetson_tk1-pm375-000-c00-00.dtb

Also, does this file actually exist at the location named by the FDT entry?

ENewnham · March 16, 2015, 12:21pm

Hey linuxdev,

yup, same result with shutdown -r now. Tried sync; reboot -f as well.

Yes, my FDT does look like that, and does exist at that location. However I believe U-Boot is failing before that, this error comes out the very first line after U-Boot prints it’s version information.

EDIT: I just tried changing my FDT file name, then rebooted, same problem. However I do believe this error is occurring before accessing the ext4 file system, so it may be the FDT that is compiled into uboot?

But that isn’t ‘corruptable’ because its baked into the uboot binary, so it must be a CPU runtime issue/corruption that causes memory to corrupt. Possibly heat related?

The best way to recreate this issue is to unplug the fan, play a video, and reboot. Repeat until reboot fails. However this can happen if the fan is on, just happens more often with it off.

linuxdev · March 16, 2015, 8:22pm

When using a serial port on serial console you should be able to log the entire shutdown and reboot-to-failure point. Would it be possible to see the logs of this?

FYI, memory sits next to the tegra124 chip…the memory I’m looking at is probably the same on all Jetsons, but may not be…the ones I see are Samsung. As a test I wonder if there would be a way to disconnect the fan but carefully add some sort of cooling capacity to the memory itself next to the tegra124. The goal being to test tegra124 heating separately from memory heating.

I believe it is more likely the memory cooling would be at issue than the tegra124 heat itself. If you don’t have a way to cool the memory chips with some sort of improvised heat sink, there are spray bottles for fast cooling which could “very carefully” be used on the memory chips. For stress testing you would completely and quickly freeze those chips while monitoring operation…but this is NOT stress testing, all you would want to do is keep memory cool but not frozen while allowing the tegra124 to heat. Or the reverse. Results with memory cooled but not tegra124, or vice versa, would help for knowing if there is a marginal component.

ENewnham · March 17, 2015, 8:30pm

My board uses the HYNIX H5TC4G63AFR-RDA.

A reason why I bring this up is in the change log for R21.3 was regarding a fix in reboot stress testing.

[200072946] Improved system stability during extended reboot stress testing

Do you know what is included in the fix? Was this a kernel fix or a U-Boot fix?

linuxdev · March 17, 2015, 9:02pm

ENewnham:

My board uses the HYNIX H5TC4G63AFR-RDA.

A reason why I bring this up is in the change log for R21.3 was regarding a fix in reboot stress testing.
[200072946] Improved system stability during extended reboot stress testing
Do you know what is included in the fix? Was this a kernel fix or a U-Boot fix?

Looking more closely at two of my boards, I see it does actually say “hynix”…it’s just hard to read.

I don’t know what the actual fix was, but it’s almost always some sort of timing adjustment and/or voltage adjustment. Initial values are always set up in the boot loader; it is likely that a reboot failure does not allow booting to an extent that the kernel controls this. nVidia will probably be interested in this, but they are still likely to want to see the logs from serial console.

ENewnham · March 18, 2015, 3:12pm

Reboot seems sporadic, but mostly related to temperatures, If the CPU temperature is above 70C when it reboots. it is more likely to fail reboot, but not guaranteed. Some more insight:

I then proceeded to turn on U-Boot debug and arrived with this log.

Broadcast message from root@localhost.localdomain
        (unknown) at 1:04 ...

The system is going down for reboot NOW!
[   67.067735] Restarting system.
[   67.070802] Restarting Linux version 3.10.40-ged4f697 (enewnham@arch) (gcc ve
rsion 4.8.3 20140401 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.04 - Lina
ro GCC 4.8-2014.04) ) #2 SMP PREEMPT Mon Mar 16 10:35:39 EDT 2015
[   67.070802] 

U-Boot - Fri Mar 13 11:13:00 EDT 2015boot device - 0
mkimage signature not found - ih_magic = ffffffff
Jumping to U-Boot
image entry point: 0x83D8E000
start_cpu entry, reset_vector = 83d8e000
tegra124_init_clocks entry
Setting up PLLX
init_pllx entry
tegra_get_chip: CHIPID is 0x40
 init_pllx: SoC = 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: SKU info byte = 0x87
tegra_get_chip: CHIPID is 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: Chip SKU = 4
 init_pllx: osc = 2
tegra_get_chip: CHIPID is 0x40
 pllx_set_rate entry
pllx_set_iddq: IDDQ: PLLX IDDQ = 0x00000000
pllx_set_rate: base = 0x00107401
pllx_set_rate: misc = 0x00040000
pllx_set_rate: base final = 0x40107401
Enabling clocks
Taking periphs out of reset
tegra124_init_clocks exit
enable_cpu_power_rail entry
pmic_enable_cpu_vdd entry
pmic_enable_cpu_vdd: Setting VDD_CORE to 1.0V via AS3722 reg 1/4D, 0x2801
pmic_enable_cpu_vdd: Setting VDD_CPU to 1.0V via AS3722 reg 0/4D, 0x3c00
pmic_enable_cpu_vdd: Setting VDD_GPU to 1.0V via AS3722 reg 6/4D, 0x2806
pmic_enable_cpu_vdd: Set VPP_FUSE to 1.2V via AS3722 reg 0x12/4E
pmic_enable_cpu_vdd: Set VDD_SDMMC to 3.3V via AS3722 reg 0x16/4E
enable_cpu_clocks entry
enable_cpu_clocks: PLLX base = 0x48107401
enable_cpu_clocks: PLLX locked, delay for stable clocks
enable_cpu_clocks: Setting CCLK_BURST and DIVIDER
enable_cpu_clocks: Enabling clock to all CPUs
enable_cpu_clocks: Enabling main CPU complex clocks
enable_cpu_clocks: Done
clock_enable_coresight entry
remove_cpu_resets entry
powerup_cpus entry
powerup_cpus entry: G cluster
powerup_cpus: CRAIL
power_partition: part ID = 00000000
power_partition, toggling state
powerup_cpus: C0NC
power_partition: part ID = 0000000F
powerup_cpus: CE0
power_partition: part ID = 0000000E
power_partition, toggling state
tegra_get_chipopw:e CruHpIP_IcpDu iss:  d0xon4e0
start_cpu exit, should continue @ reset_vector

.... this is where the CPU hangs. It does not continue to the reset_vector.

A successful reboot looks like this:

Broadcast message from root@localhost.localdomain
        (unknown) at 1:03 ...

The system is going down for reboot NOW!
[   80.677904] Restarting system.
[   80.681061] Restarting Linux version 3.10.40-ged4f697 (enewnham@arch) (gcc ve
rsion 4.8.3 20140401 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.04 - Lina
ro GCC 4.8-2014.04) ) #2 SMP PREEMPT Mon Mar 16 10:35:39 EDT 2015
[   80.681061] 

U-Boot - Fri Mar 13 11:13:00 EDT 2015boot device - 0
mkimage signature not found - ih_magic = ffffffff
Jumping to U-Boot
image entry point: 0x83D8E000
start_cpu entry, reset_vector = 83d8e000
tegra124_init_clocks entry
Setting up PLLX
init_pllx entry
tegra_get_chip: CHIPID is 0x40
 init_pllx: SoC = 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: SKU info byte = 0x87
tegra_get_chip: CHIPID is 0x40
tegra_get_sku_info: SKU info byte is 0x87
 init_pllx: Chip SKU = 4
 init_pllx: osc = 2
tegra_get_chip: CHIPID is 0x40
 pllx_set_rate entry
pllx_set_iddq: IDDQ: PLLX IDDQ = 0x00000000
pllx_set_rate: base = 0x00107401
pllx_set_rate: misc = 0x00040000
pllx_set_rate: base final = 0x40107401
Enabling clocks
Taking periphs out of reset
tegra124_init_clocks exit
enable_cpu_power_rail entry
pmic_enable_cpu_vdd entry
pmic_enable_cpu_vdd: Setting VDD_CORE to 1.0V via AS3722 reg 1/4D, 0x2801
pmic_enable_cpu_vdd: Setting VDD_CPU to 1.0V via AS3722 reg 0/4D, 0x3c00
pmic_enable_cpu_vdd: Setting VDD_GPU to 1.0V via AS3722 reg 6/4D, 0x2806
pmic_enable_cpu_vdd: Set VPP_FUSE to 1.2V via AS3722 reg 0x12/4E
pmic_enable_cpu_vdd: Set VDD_SDMMC to 3.3V via AS3722 reg 0x16/4E
enable_cpu_clocks entry
enable_cpu_clocks: PLLX base = 0x48107401
enable_cpu_clocks: PLLX locked, delay for stable clocks
enable_cpu_clocks: Setting CCLK_BURST and DIVIDER
enable_cpu_clocks: Enabling clock to all CPUs
enable_cpu_clocks: Enabling main CPU complex clocks
enable_cpu_clocks: Done
clock_enable_coresight entry
remove_cpu_resets entry
powerup_cpus entry
powerup_cpus entry: G cluster
powerup_cpus: CRAIL
power_partition: part ID = 00000000
power_partition, toggling state
powerup_cpus: C0NC
power_partition: part ID = 0000000F
powerup_cpus: CE0
power_partition: part ID = 0000000E
power_partition, toggling state
tegra_get_chipiopw: erCuHpI_PcIpDu si:s  0doxn40e
opw: erCuHpI_PcIpDu si:s  done

e tegarrat__cgeptu_ echxiipt,:  CshHIoPulIdD  icson 0tixn4u0
  @ res�initcall: 83dc71c0
initcall: 83dc9008


U-Boot 2014.10-rc2-svn7540 (Mar 18 2015 - 10:15:23)

initcall: 83d96b10
U-Boot code: 83D8E000 -> 83DEDDA4  BSS: -> 83F43FBC
initcall: 83d904c0
TEGRA124
initcall: 83d8f70c
Board: NVIDIA Jetson TK1
initcall: 83d96b58

............ then it continues on into linux.

What is strange that the last legible debug message is “power_partition, toggling state” then is proceeds to be garbled. I believe this garbling is a result of the CPU beginning a reset cycle.

All this fun stuff is occuring in

src/arch/arm/cpu/arm720t/tegra124/cpu.c
277:void start_cpu(u32 reset_vector)

linuxdev · March 18, 2015, 4:42pm

If the u-boot code logic was the cause of failure, then the odds are high that the failure would be consistent and not change just with increased heat. A marginal voltage or clock setting could do this, but then I would expect some instability under a wider set of circumstances outside of boot loader execution. The log is kind of a “smoking gun” that the failure begins during the boot loader and never reaches the kernel.

What I find interesting is that the first success/fail difference seems to occur at line 62 “tegra_get_chipopw”. I see several “tegra_get_chip…” calls in R21.3 u-boot source, but I do not see “iopw” anywhere in any of the source files. I’m not sure what to make of that.

The “power_partition, toggling state” is identical between success and failure. I’m not all that familiar with this code, but it seems that this function is activating eMMC in some way. The code which fails once past this function tends to make me believe the issue is not the cpu reset cycle, but instead a memory access failure (reset would be a side-effect of the failure, but not the cause of the failure). The last part of this function is:

/* Give I/O signals time to stabilize */
udelay(IO_STABILIZATION_DELAY);

I have not examined earlier versions of u-boot, so I don’t know if anything here has changed recently, but this final setting makes me very very suspicious that this “stabilization” delay is for the very purpose of preventing the issue you are running into.

In “arch/arm/cpu/arm720t/tegra-common/cpu.h” “IO_STABILIZATION_DELAY” is defined as:

/* Stabilization delays, in usec */
...
#define IO_STABILIZATION_DELAY  (1000)

I don’t know if you are feeling adventurous, but I wonder if arbitrarily increasing IO_STABILIZATION_DELAY to something like 1250 or 1500 would increase reboot success under stress based on heat.

ENewnham · March 19, 2015, 12:36pm

Line 62 is actually garbled, and is a combination of a few print statements.

tegra_get_chipopw:e CruHpIP_IcpDu iss:  d0xon4e0
tegra_get_chip: CHIPID is 0x40
powerup_cpus: done

which is funky to say the least. I will try increasing the delays.

chuang · April 2, 2015, 10:19pm

The reboot issue addressed by R21.3 can be found in the release notes,
http://developer.download.nvidia.com/embedded/L4T/r21_Release_v3.0/Tegra_Linux_Driver_Package_Release_Notes_R21.3.pdf

one entry about reboot stress testing

youngk · April 3, 2015, 4:52am

Let me ask from H/W perspective if it is memory corruption issue due to memory layout difference.

Are you using PM375_Hynix_2GB_H5TC4G63AFR_RDA_924MHz.cfg in \bootloader\ardbeg\BCT?
Is this your designed own board, not Jetson?
All the memory related layout, PCB stackup & PCB material are exactly the same as that of Jetson if yes for 2. ?
Have you followed the memory layout requirements mentioned in Tegra K1 Embedded Platform Design Guide in Jetson portal if no for 3. ?
Have you gone through memory characterization process in order to generate optimal .cfg (memory controller settings) file for your memory layout if yes for 2. ?
Please refer to https://developer.nvidia.com/rdp/assets/tegra-k1-memory-characterization

You may also want to try nvflash with lower clocked PM375_Hynix_2GB_H5TC4G63AFR_RDA_792MHz.cfg \bootloader\ardbeg\BCT to see if the issue is less reproducible or not.

Topic		Replies	Views
U-Boot failure Jetson TX1	13	2008	October 31, 2016
TK1 boot failure and debug serial terminal stops working as well Jetson TK1	7	2409	November 25, 2015
Jetson TX2 Kernel crashed after running for a while Jetson TX2 kernel	65	3552	June 19, 2021
TK1 Not booting Jetson TK1	16	4702	November 27, 2017
No boot after initial start on image update Jetson TK1	3	715	May 10, 2017
Jetson TK1 board fails to boot Jetson TK1	41	15041	June 19, 2017
Linux crashes/lock up shortly after kernel boot Jetson TK1	7	1864	March 27, 2017
JetsonTX1 failed to boot after shutdown -P now Jetson TX1	6	1013	October 31, 2017
Occasional failures to establish ethernet link, during stress-reboot test of TK1 with latest L4T 21.6 Jetson TK1	7	1028	February 21, 2018
TX1 Freezes After Boot Jetson TX1	5	1151	November 7, 2017

Reboot failure.

Related topics