Jetson TK1: What the "MSELECT error" meanings? How to debug it? TK1 crashed when working w

Hi,

Our jetson-tk1 project encounters a MSELECT error crashing when debugging PCIE: the project uses both LVDS and PCIE, the jetson-tk1 board communicates with a FPGA board through mini-PCIE interface. The PCIE works properly, but it crashed after working for a while (sometimes for 5 minutes, sometimes for 2 hours?).
It seems that the PCIE timeouts before crash (no PCIE data log output for a while before crash output), the debug port prints as below:

[10:12:47]MSELECT error detected! status=0x4
[10:12:47]------------[ cut here ]------------
[10:12:47][ 106.381433] ------------[ cut here ]------------
[10:12:47]kernel BUG at /home/evan/Jetson_TK1/JetPackTK1-1.2/Linux_for_Tegra/sources/kernel_source/drivers/platform/tegra/hier_ictlr/hier_ictlr.c:59!
[10:12:47][ 106.399356] kernel BUG at /home/evan/Jetson_TK1/JetPackTK1-1.2/Linux_for_Tegra/sources/kernel_source/drivers/platform/tegra/hier_ictlr/hier_ictlr.c:59!
[10:12:47]Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[10:12:47][ 106.418226] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[10:12:47]Modules linked in:[ 106.425665] Modules linked in:
[10:12:47]
[10:12:47]CPU: 0 PID: 2158 Comm: pcie_preview_02 Tainted: G W 3.10.40-gdacac96-dirty #86

I debuged and the trigger code locate at:
drivers/platform/tegra/hier_ictlr/hier_ictlr.c
at the ISR of hier_ictlr, it triggers coredump at once when MSELECT error occurs:

static irqreturn_t tegra_hier_ictlr_irq_handler(int irq, void *data)
{
	status = readl(ictlr->mselect_base + MSELECT_ERROR_STATUS_0);
	if (status != 0) {
		printk(KERN_ERR"MSELECT error detected! status=0x%x\n",
			(unsigned int)status);
		<b>BUG();    // <--- trigger here</b>
      }
}

After searched in TegraK1_TRM, I can’t find the detail information about status of MSELECT, and I don’t know the meanings of “status=0x4” of above, and the information of MSELECT is lacking in the manual, it is difficult to understand the machanism of MSELECT from the manual.

So does anbody know what the “MSELECT error” above meanings? or What should to do to debug above problems?
Any help is very appreciated!

BTW, after debugging for days, it is found that the probability of MSELECT error is reduced rapidly after turn-off the suspending function of system (In Ubuntu desktop, “All Settings” -> “Brightness & Lock” -> “Turn Screen off when inactive for xx minute”, change it to “Never”), but it is still there, occurs with small probability.

Regards

Hello, evanxiao:
MSELECT error status bit 2 means ‘M1 read timeout error’, set by hardware when error encountered.
From your environment, it’s most probable that the PCIe device does not respond.

You can apply the following patch to avoid system hang when the error happens, but it’s still necessary to root-cause why that error happens.

diff --git a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
index dbc65d6…65999ee 100644
— a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
+++ b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
@@ -51,7 +51,7 @@ static irqreturn_t tegra_hier_ictlr_irq_handler(int irq, void *data)
if (status != 0) {
pr_err(“MSELECT error detected! status=0x%x\n”,
(unsigned int)status);

  •   BUG();
    
  •   WARN_ON(1);
    

    }

    return IRQ_HANDLED;

diff --git a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
index 65999ee…0720b5d 100644
— a/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
+++ b/drivers/platform/tegra/hier_ictlr/hier_ictlr.c
@@ -1,5 +1,5 @@
/*

    • Copyright © 2013-2014, NVIDIA CORPORATION. All rights reserved.
    • Copyright © 2013-2015, NVIDIA CORPORATION. All rights reserved.
    • This program is free software; you can redistribute it and/or modify
    • it under the terms of the GNU General Public License as published by
      @@ -40,6 +40,7 @@
      #define MSELECT_TIMEOUT_TIMER_0 0x5c
      #define MSELECT_ERROR_STATUS_0 0x60
      #define MSELECT_DEFAULT_TIMEOUT 0xFFFFFF
      +#define MSELECT_ERROR_STATUS_CLEAR 0x3FF

static irqreturn_t tegra_hier_ictlr_irq_handler(int irq, void *data)
{
@@ -126,6 +127,9 @@ static int tegra_hier_ictlr_mselect_init(struct platform_device *pdev,

tegra_hier_ictlr_set_mselect_timeout(ictlr, MSELECT_DEFAULT_TIMEOUT);
  • /*clear error status register */
  • writel(MSELECT_ERROR_STATUS_CLEAR,
  •   	ictlr->mselect_base + MSELECT_ERROR_STATUS_0);
    
    reg = readl(ictlr->mselect_base + MSELECT_CONFIG_0);
    writel(reg |
    ((1 << MSELECT_CONFIG_0_READ_TIMEOUT_EN_SLAVE0_SHIFT) |

br
ChenJian

Hello, ChenJian,

Thank you very much for your reply.
I will try your patch and contact with FPGA to find if it is the problem on PCIe.

Regards