Jetson TK1 sometimes fails to boot with u-boot

Hey,

after using u-boot to test my custom kernel I have encountered a weird problem.
The board sometimes (like 1 out of 10-15 times) will not boot correctly.

It gets stuck here:

MSELECT error detected! status=0x100
[    5.786767] tegra-hier-ictlr tegra-hier-ictlr: probed
[    5.807609] ------------[ cut here ]------------
[    5.817829] kernel BUG at /dvs/git/dirty/git-master_linux/kernel/drivers/platform/tegra/hier_ictlr/hier_ictlr.c:54!
[    5.834037] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[    5.845660] Modules linked in:
[    5.854539] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 3.10.40-gdacac96 #1
[    5.867444] Workqueue: kmmcd mmc_rescan
[    5.877163] task: ee0a7080 ti: ee106000 task.ti: ee106000
[    5.888485] PC is at tegra_hier_ictlr_irq_handler+0x38/0x40
[    5.899928] LR is at tegra_hier_ictlr_irq_handler+0x38/0x40
[    5.911263] pc : [<c0622e8c>]    lr : [<c0622e8c>]    psr: 20000193
[    5.911263] sp : ee107ba8  ip : 00000000  fp : c0d13d98
[    5.934318] r10: 00000000  r9 : 00000000  r8 : ee106000
[    5.945336] r7 : ee107c48  r6 : 0000009f  r5 : c0bd5910  r4 : ed0c06c0
[    5.957663] r3 : 00000002  r2 : c0c99f70  r1 : 20000193  r0 : 00000024
[    5.969979] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[    5.983208] Control: 10c5387d  Table: 8000406a  DAC: 00000015
[    5.994810] 
[    5.994810] PC: 0xc0622e0c:
[    6.010660] 2e0c  ebf80a42 e1a01004 e3a02000 e1a00005 ebf819ac e1a00004 e3a01000 ebf81a34
[    6.024928] 2e2c  e1a00004 e59f1008 e3a02000 e8bd4070 eaf81960 c0cf3098 e92d4000 e8bd4000
[    6.039247] 2e4c  e3a00004 e12fff1e e92d4008 e92d4000 e8bd4000 e1a00001 ebf7e1c5 e590300c
[    6.053564] 2e6c  e5931060 f57ff04f e3510000 1a000001 e3a00001 e8bd8008 e59f0004 eb07bd64
[    6.067954] 2e8c  e7f001f2 c0a97fec e92d40f8 e92d4000 e8bd4000 e1a06001 e1a07002 e3a01c02
[    6.082417] 2eac  e3a02000 e1a04000 ebf7e9b0 e2501000 0a000012 e2845010 e1a00005 ebf28b09
[    6.096948] 2ecc  e3500000 e5860000 0a000016 e3a01c02 e1a00004 e3a02001 ebf7e9a4 e2501000
[    6.111540] 2eec  0a00000b e1a00005 ebf28afe e3500000 e5870000 0a000010 e3a00000 e8bd80f8
[    6.126197] 
...

I went back to using my original image and still had the problem with the original kernel from the L4T 21.4.

After trying the same kernel with fastboot the issue did not persist. At least I was not able to reproduce it.

My u-boot kernel commandline is unmodified as follows:

console=ttyS0,115200n8 console=tty1 no_console_suspend=1 lp0_vec=2064@0xf46ff000 mem=2015M@2048M memtype=255 ddr_die=2048M@2048M section=256M pmuboard=0x0177:0x0000:0x02:0x43:0x00 tsec=32M@3913M otf_key=c75e5bb91eb3bd947560357b64422f85 usbcore.old_scheme_first=1 core_edp_mv=1150 core_edp_ma=4000 tegraid=40.1.1.0.0 debug_uartport=lsport,3 power_supply=Adapter audio_codec=rt5640 modem_id=0 android.kerneltype=normal fbcon=map:1 commchip_id=0 usb_port_owner_info=0 lane_owner_info=6 emc_max_dvfs=0 touch_id=0@0 board_info=0x0177:0x0000:0x02:0x43:0x00 root=/dev/mmcblk0p1 rw rootwait tegraboot=sdmmc gpt

While the fastboot commandline looks like this:

fbcon=map:1 tegraid=40.1.1.0.0 mem=1862M@2048M memtype=255 vpr=151M@3945M tsec=32M@3913M otf_key=20a408444a4cd8f593bb2b3bdd6073d6 ddr_die=2048M@2048M section=256M commchip_id=0 usb_port_owner_info=0 lane_owner_info=6 emc_max_dvfs=0 touch_id=0@10 video=tegrafb no_console_suspend=1 console=ttyS0,115200n8 debug_uartport=lsport,3 console=tty1 sku_override=0 usbcore.old_scheme_first=1 lp0_vec=2080@0xf46ff000 tegra_fbmem=32899072@0xad012000 core_edp_mv=1150 core_edp_ma=4000 pmuboard=0x0177:0x0000:0x03:0x45:0x00 power_supply=Adapter board_info=0x0177:0x0000:0x03:0x45:0x00 root=/dev/mmcblk0p1 rw rootwait tegraboot=sdmmc gpt gpt_sector=94207 modem_id=0 watchdog=disable android.kerneltype=normal

I have noticed that the 2015MB ram used in the u-boot options are split into mem=1862M and and vpr=151M for fastboot. Could there be any issue? I have seen other quotes of u-boot command lines using just 1862M ram.

Side info: I changed my partition table to use one system partition of 11580MiB and one additional 3000MiB partition to store my application data.

My question is, what could be causing the random boot fails with the u-boot bootloader?

Additionally I wonder if anyone can tell me what the “vpr=xxMB” options will do?
Using "IGNOREFASTBOOTCMDLINE=“vpr=151M@3945M” and
CMDLINE_ADD=“fbcon=map:1 usbcore.usbfs_memory_mb=1000 mem=2015M@2048M”;
did result in a not booting system with fastboot.
I’d like to at least have the 1960MB ram in fastboot if i have to use it to make sure that the system always boots.

Quick Update:

The problem seems to be that the system is generating an interrupt directly after the interrupt handler was registered. The mselect_base value seems not to be the expected one, as this will generate the “MSELECT error detected! status=0x%x\n” message.

Since this error is shown before the module probe is finished I’d like to know if the problem could be caused by “tegra_hier_ictlr_create_sysfs(pdev);” being run AFTER registering the interrupt handler and not before?

Code of the function, from /linux/drivers/platform/tegra/hier_ictrl.c:

static int tegra_hier_ictlr_probe(struct platform_device *pdev)
{
	struct tegra_hier_ictlr *ictlr;
	int ret;

	ictlr = devm_kzalloc(&pdev->dev, sizeof(struct tegra_hier_ictlr),
		GFP_KERNEL);
	if (!ictlr)
		return -ENOMEM;

	dev_set_drvdata(&pdev->dev, ictlr);

	ret = tegra_hier_ictlr_map_memory(pdev, ictlr);
	if (ret)
		return ret;

	ret = tegra_hier_ictlr_mselect_init(pdev, ictlr);
	if (ret)
		return ret;

	ret = tegra_hier_ictlr_irq_init(pdev, ictlr);
// Interrupt happened after this	

        if (ret)
		return ret;

	tegra_hier_ictlr_create_sysfs(pdev);

// Interrupt happened before this

	dev_notice(&pdev->dev, "probed\n");

	return 0;
}

Interrupt handler, from /linux/drivers/platform/tegra/hier_ictrl.c::

static irqreturn_t tegra_hier_ictlr_irq_handler(int irq, void *data)
{
	struct device *dev = data;
	struct tegra_hier_ictlr *ictlr = dev_get_drvdata(dev);
	unsigned long status;

	status = readl(ictlr->mselect_base + MSELECT_ERROR_STATUS_0);
	if (status != 0) {
		pr_err("MSELECT error detected! status=0x%x\n",
			(unsigned int)status);
		BUG();
	}

	return IRQ_HANDLED;
}

Maybe now someone is able to shed some light on this issue.

"I went back to using my original image and still had the problem with the original kernel from the L4T 21.4.
=> is the failure frequency same as that of 1 out of 10-15 times?

vpr is video protection region. You find brief description and physical address map from TRM.

I can’t really say that I get one fail every 15 times but it feels like the same frequency in both cases yeah.

Thanks for the info about vpr!

I have in the meantime modified the hier_ictlt.c and moved the “tegra_hier_ictlr_create_sysfs(pdev);”
above the registration of the interrupt handler.

Since that I have successfully rebooted my board 160 times without a single kernel panic using u-boot.
Might be coincidence but it seems like there is at least an effect either due to it really being necessary or just because it gives the system more time to initialize…

Thanks for your posts, related issue it seems:

[PATCH] tegra: ictlr: fix crash when an IRQ fire during the probe

The IRQ handler use drvdata, however drvdata was set *after*
registering the IRQ handler. If an IRQ fired before drvdata was set it
would crash the kernel. Fix this by setting drvdata before registering
the IRQ handler.

---
 drivers/platform/tegra/hier_ictlr/hier_ictlr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
index 1859d97..dbc65d6 100644
--- a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
+++ b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
@@ -160,6 +160,8 @@ static int tegra_hier_ictlr_probe(struct platform_device *pdev)
 	if (!ictlr)
 		return -ENOMEM;
 
+	dev_set_drvdata(&pdev->dev, ictlr);
+
 	ret = tegra_hier_ictlr_map_memory(pdev, ictlr);
 	if (ret)
 		return ret;
@@ -176,7 +178,6 @@ static int tegra_hier_ictlr_probe(struct platform_device *pdev)
 
 	dev_notice(&pdev->dev, "probed\n");
 
-	dev_set_drvdata(&pdev->dev, ictlr);
 	return 0;
 }
 
-- 
2.1.4

Thanks for your answer dusty. I have seen the bug from 21.3 was fixed in 21.4. That’s the version I am using at the moment. Checked the file, the fix is already applied.

The only thing i could notice which can also be related to the data is the missing creation of the sysfs file before the interrupt registration here:

ret = tegra_hier_ictlr_irq_init(pdev, ictlr);
if (ret)
	return ret;

tegra_hier_ictlr_create_sysfs(pdev);

After moving it to an earlier place the panic does not happen anymore for now. Could you check with your team if there is any relation to it or if it is working now by coincidence due to timing changes? I am not an expert in driver programming so that could really be useful.

Hey,

any news from your side dusty?

Please apply this patch if you are seeing this error. It will be part of next release

diff --git a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
index 65999ee..837e546 100644
--- a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
+++ b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
@@ -40,6 +40,7 @@
 #define MSELECT_TIMEOUT_TIMER_0                                     0x5c
 #define MSELECT_ERROR_STATUS_0                                      0x60
 #define MSELECT_DEFAULT_TIMEOUT                                 0xFFFFFF
+#define MSELECT_ERROR_STATUS_CLEAR				0x3FF
 
 static irqreturn_t tegra_hier_ictlr_irq_handler(int irq, void *data)
 {
@@ -126,6 +127,9 @@ static int tegra_hier_ictlr_mselect_init(struct platform_device *pdev,
 
 	tegra_hier_ictlr_set_mselect_timeout(ictlr, MSELECT_DEFAULT_TIMEOUT);
 
+	/*clear error status register */
+	writel(MSELECT_ERROR_STATUS_CLEAR,
+			ictlr->mselect_base + MSELECT_ERROR_STATUS_0);
 	reg = readl(ictlr->mselect_base + MSELECT_CONFIG_0);
 	writel(reg |
 		((1 << MSELECT_CONFIG_0_READ_TIMEOUT_EN_SLAVE0_SHIFT)  |
-- 
2.1.4