Boot stuck

Hi guys, I met a problem when doing reboot test. The reboot test is for test the AQC113 connected on 14160000.pcie controller could be successfully enumerated. My reboot script is as following:

# ~/.profile
bash ./reboot_test.sh

# ~/reboot_test.sh
if lspci | grep Aquantia;then
        echo rebooting
        sleep 10
        echo password | sudo -S reboot
fi

And I’ve modified the kernel code trying to print some info. Here is the git difference:

diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
index ae511b84b3d8..66ff99ba3ea6 100644
--- a/drivers/pci/controller/dwc/pcie-designware-host.c
+++ b/drivers/pci/controller/dwc/pcie-designware-host.c
@@ -417,6 +417,7 @@ int dw_pcie_host_init(struct pcie_port *pp)
 			goto err_free_msi;
 	}
 
+	printk("ben %s: %d %s, %s\n", __FILE__, __LINE__, __func__, "");
 	/* Ignore errors, the link may come up later */
 	dw_pcie_wait_for_link(pci);
 
diff --git a/drivers/pci/controller/dwc/pcie-designware.c b/drivers/pci/controller/dwc/pcie-designware.c
index 5563979310ba..6de020064313 100644
--- a/drivers/pci/controller/dwc/pcie-designware.c
+++ b/drivers/pci/controller/dwc/pcie-designware.c
@@ -550,6 +550,8 @@ int dw_pcie_wait_for_link(struct dw_pcie *pci)
 {
 	int retries;
 
+        dev_info(pci->dev, "ben ");
+	dump_stack();
 	/* Check if the link is up or not */
 	for (retries = 0; retries < LINK_WAIT_MAX_RETRIES; retries++) {
 		if (dw_pcie_link_up(pci)) {
@@ -557,6 +559,7 @@ int dw_pcie_wait_for_link(struct dw_pcie *pci)
 			return 0;
 		}
 		usleep_range(LINK_WAIT_USLEEP_MIN, LINK_WAIT_USLEEP_MAX);
+		dev_info(pci->dev, "ben: wait loop (%d/%d)\n", retries+1, LINK_WAIT_MAX_RETRIES);
 	}
 
 	dev_info(pci->dev, "Phy link never came up\n");
@@ -569,10 +572,12 @@ int dw_pcie_link_up(struct dw_pcie *pci)
 {
 	u32 val;
 
+	//dev_info(pci->dev, "ben %s: %d %s, before get link up\n", __FILE__, __LINE__, __func__);
 	if (pci->ops && pci->ops->link_up)
 		return pci->ops->link_up(pci);
 
 	val = readl(pci->dbi_base + PCIE_PORT_DEBUG1);
+	dev_info(pci->dev, "ben %s: %d %s, PCIE_PORT_DEBUG1=0x%08X\n", __FILE__, __LINE__, __func__, val);
 	return ((val & PCIE_PORT_DEBUG1_LINK_UP) &&
 		(!(val & PCIE_PORT_DEBUG1_LINK_IN_TRAINING)));
 }
diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
index 76c57d4fa714..fd4c78dab204 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -25,7 +25,7 @@
 #define DW_PCIE_VER_562A		0x3536322A
 
 /* Parameters for the waiting for link up routine */
-#define LINK_WAIT_MAX_RETRIES		10
+#define LINK_WAIT_MAX_RETRIES		100
 #define LINK_WAIT_USLEEP_MIN		90000
 #define LINK_WAIT_USLEEP_MAX		100000
 
diff --git a/drivers/pci/controller/dwc/pcie-tegra194.c b/drivers/pci/controller/dwc/pcie-tegra194.c
index a44b477cee27..77fd5cdaaa4b 100644
--- a/drivers/pci/controller/dwc/pcie-tegra194.c
+++ b/drivers/pci/controller/dwc/pcie-tegra194.c
@@ -1548,6 +1548,7 @@ static int tegra_pcie_dw_link_up(struct dw_pcie *pci)
 {
 	struct tegra_pcie_dw *pcie = to_tegra_pcie(pci);
 	u32 val = dw_pcie_readw_dbi(pci, pcie->pcie_cap_base + PCI_EXP_LNKSTA);
+        dev_info(pci->dev, "%s: val=%u", __func__, val);
 
 	return !!(val & PCI_EXP_LNKSTA_DLLLA);
 }

The problem as following: I should catch the case that the AQC113 10G netcard is not enumerated by the kernel, and then I should be in interactive shell through debug serial. But I never met this case, and I met this problem. The end print from kernel is as following:

[   29.207841] pcieport 0008:00:00.0: Adding to iommu group 8
[   29.207922] pcieport 0008:00:00.0: PME: Signaling with IRQ 186
[   29.208222] pcieport 0008:00:00.0: AER: enabled with IRQ 186

Comparing to normal boot, what should be printed in the next is:

[    7.922666] systemd[1]: systemd 249.11-0ubuntu3.11 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMAC)
[    7.922979] systemd[1]: Detected architecture arm64.
[    8.029953] systemd[1]: Hostname set to <V701>.
[    8.049820] systemd[1]: Using hardware watchdog 'NVIDIA Tegra186 WDT', version 0, device /dev/watchdog
[    8.049837] systemd[1]: Set hardware watchdog to 2min.
...

Here is the full log from reboot.
boot_stuck.log (96.6 KB)

So if you disable 14160000.pcie then it won’t get stuck?

OK, I’ll try it. Try to find out if these two has any relation.

After testing for about 3 hours, the problem does not show any more. Then I re-enable it, and without half an hour, the problem show up again. The only thing I’ve done are:

  1. modify the dts, disable the node.
  2. rebuild the kernel and dtb.
  3. push the dtb to device:/boot/ and /boot/dtb.

Is this device able to test on Orin NX devkit?

Sorry for that. We do not have Orin NX devkit.

That you ask for devkit is for:

  1. confirm that module is OK.
  2. if problem is from the design of custom board

is that right?

But we do have another customer board which has less probability of failing to enumerate the 10G network card. Also, we have another NX module. I’ll do cross-validation.

Besides, do you have any other suggestions?

The weird part in your case is your hang happens when

[   29.207841] pcieport 0008:00:00.0: Adding to iommu group 8
[   29.207922] pcieport 0008:00:00.0: PME: Signaling with IRQ 186
[   29.208222] pcieport 0008:00:00.0: AER: enabled with IRQ 186

But this is PCIe C8 which is nothing to do with 14160000.pcie

Hi Wayne, we were still using r36.2 to do test, but here we used another custom board, which has less probability of cannot enumerating 14160000 PCIe. And also I removed the kernel patch of “trying with 100 time”. After rebooting about 120 times, the tegra194-pcie 14160000.pcie: Phy link never came up showed up. As the judgement from profile failed, it then logged into the shell, and after some time (I did not print any log, so I don’t know how long it takes), the system hanged up. Here is the logs.

last_boot_with_stuck_problem.log (79.0 KB)

full_reboot_test.log (9.4 MB) (including the above log)

It looks like nothing gets printed there. Will the watchdog triggered when such hang happened?

Is the watchdog you mean that on the module? If that is, no, the watchdog did not triggered.

Sorry for the late response.
Is this still an issue to support? Any result can be shared?

As the original problem is enumerating the pcie device, we are trying to change the power-on timing. Could you give more time for this topic? I will reply if there is any new progress. Or if there is any “hang-on” like state to set for the topic?

Cannot reproduce with about 4000 reboot test. May related to this problem:

https://forums.developer.nvidia.com/t/orin-nx-hangs-on-optee-r35-5-0/286262/9

1 Like