Boot stuck while enumerating NVMe via PFX switch, seems to be PCIe driver issue

Hi @kayccc , we are still investigating the issue. It hasn’t been resolved.

Please update your status periodically otherwise we will close the topic.

What I have observed is that the value of memory region is not getting changed if the device is behind a switch.

I have done following tests:

  1. Apacer NVMe behind a switch and memory region is showing as
    pci 0005:03:00.0: reg 0x10: [mem 0x1f42c00000-0x1f42c03fff 64bit]

  2. Micron NVMe behind a switch and memory region is showing as
    pci 0005:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]

  3. Apacer NVMe on devkit and memory region is showing as
    pci 0005:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]

can you please help me understand why Micron NVMe memory range address is not getting updated when behind a switch? What can be the root cause of the issue?

Thanks

Hi,

Could you focus on replying our previous question first?

We will check your question and reply later.

We also need dmesg/uart log for all these 3 cases.

  1. Apacer NVMe behind a switch and memory region is showing as
    pci 0005:03:00.0: reg 0x10: [mem 0x1f42c00000-0x1f42c03fff 64bit] ->Working case
  2. Micron NVMe behind a switch and memory region is showing as
    pci 0005:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit] -> NG case
  3. Apacer NVMe on devkit and memory region is showing as
    pci 0005:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit] -> Working case

boot-log-gen3-with-prints-6-Feb-2024-Apacer.txt (121.7 KB)
boot-log-gen3-with-prints-6-Feb-2024-Micron.txt (84.8 KB)
dmesg_devkit_apacer.txt (135.3 KB)
Here are the boot log for all 3 cases.

Hi,

We just double checked with internal team… Actually the Micron SSD is doing the correct behavior in these logs.

Please focus on switch and our previous question first.

The logs I have sent are after adding prints in pci_fixup_device and pci_do_fixup function.

Please share what code/print you added there so that we can do comparison.

Apologies, fixup related logs are not added. I did not find much information in that function.

Hi @WayneWWW , can you please help regarding what kind of quirks needs to be added for either switch or NVMe?

No, we don’t know. The situation of this whole bug is we don’t know and don’t have this kind of switch you are using. We totally rely on the logs you shared to check.

Maybe you could check with the switch vendor and tell them about this situation you saw on Kernel 5.10.

BTW, what information do you need for fixup function? I mean you added something but it didn’t show up?

What we need to you do is compare the result in quirks.c between rel-32 and rel-35.

[ 10.508594] pci 0004:01:00.0: Inside pci_fixup_device
[ 10.513623] pci 0004:01:00.0: Inside pci_do_fixups

kernel/kernel-5.10/drivers/pci/quirks.c

In case you don’t know what we are telling… compare the different between rel-32 and rel-35 for each case happened in “pci_fixup_device”.

For those differences, check if any of them match the switch/nvme vendor & device ids. That is the key quirk.

Hi @WayneWWW , Thanks for clarification. I have looked over the differences and major difference related to switch/nvme vendor is below:

//*
 * The Samsung SM961/PM961 controller can sometimes enter a fatal state after
 * FLR where config space reads from the device return -1.  We seem to be
 * able to avoid this condition if we disable the NVMe controller prior to
 * FLR.  This quirk is generic for any NVMe class device requiring similar
 * assistance to quiesce the device prior to FLR.
 *
 * NVMe specification: https://nvmexpress.org/resources/specifications/
 * Revision 1.0e:
 *    Chapter 2: Required and optional PCI config registers
 *    Chapter 3: NVMe control registers
 *    Chapter 7.3: Reset behavior
 */
static int nvme_disable_and_flr(struct pci_dev *dev, int probe)
{
	void __iomem *bar;
	u16 cmd;
	u32 cfg;
#ifdef PCI_DRIVER_DEBUG
	pci_info(dev, "Inside nvme_disable_and_flr:: vendor=%x device=%x probe=%d\n", dev->vendor, dev->device, probe);
#endif
	if (dev->class != PCI_CLASS_STORAGE_EXPRESS ||
	    !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
		return -ENOTTY;

	if (probe)
		return 0;

	bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
	if (!bar)
		return -ENOTTY;

	pci_read_config_word(dev, PCI_COMMAND, &cmd);
	pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);

	cfg = readl(bar + NVME_REG_CC);

	/* Disable controller if enabled */
	if (cfg & NVME_CC_ENABLE) {
		u32 cap = readl(bar + NVME_REG_CAP);
		unsigned long timeout;

		/*
		 * Per nvme_disable_ctrl() skip shutdown notification as it
		 * could complete commands to the admin queue.  We only intend
		 * to quiesce the device before reset.
		 */
		cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);

		writel(cfg, bar + NVME_REG_CC);

		/*
		 * Some controllers require an additional delay here, see
		 * NVME_QUIRK_DELAY_BEFORE_CHK_RDY.  None of those are yet
		 * supported by this quirk.
		 */

		/* Cap register provides max timeout in 500ms increments */
		timeout = ((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;

		for (;;) {
			u32 status = readl(bar + NVME_REG_CSTS);

			/* Ready status becomes zero on disable complete */
			if (!(status & NVME_CSTS_RDY))
				break;

			msleep(100);

			if (time_after(jiffies, timeout)) {
				pci_warn(dev, "Timeout waiting for NVMe ready status to clear after disable\n");
				break;
			}
		}
	}

	pci_iounmap(dev, bar);

	pcie_flr(dev);

	return 0;
}

and switch vendor related quirk here:

/*
 * Microsemi Switchtec NTB uses devfn proxy IDs to move TLPs between
 * NT endpoints via the internal switch fabric. These IDs replace the
 * originating requestor ID TLPs which access host memory on peer NTB
 * ports. Therefore, all proxy IDs must be aliased to the NTB device
 * to permit access when the IOMMU is turned on.
 */
static void quirk_switchtec_ntb_dma_alias(struct pci_dev *pdev)
{
	void __iomem *mmio;
	struct ntb_info_regs __iomem *mmio_ntb;
	struct ntb_ctrl_regs __iomem *mmio_ctrl;
	u64 partition_map;
	u8 partition;
	int pp;

	if (pci_enable_device(pdev)) {
		pci_err(pdev, "Cannot enable Switchtec device\n");
		return;
	}

	mmio = pci_iomap(pdev, 0, 0);
	if (mmio == NULL) {
		pci_disable_device(pdev);
		pci_err(pdev, "Cannot iomap Switchtec device\n");
		return;
	}

	pci_info(pdev, "Setting Switchtec proxy ID aliases\n");

	mmio_ntb = mmio + SWITCHTEC_GAS_NTB_OFFSET;
	mmio_ctrl = (void __iomem *) mmio_ntb + SWITCHTEC_NTB_REG_CTRL_OFFSET;

	partition = ioread8(&mmio_ntb->partition_id);

	partition_map = ioread32(&mmio_ntb->ep_map);
	partition_map |= ((u64) ioread32(&mmio_ntb->ep_map + 4)) << 32;
	partition_map &= ~(1ULL << partition);

	for (pp = 0; pp < (sizeof(partition_map) * 8); pp++) {
		struct ntb_ctrl_regs __iomem *mmio_peer_ctrl;
		u32 table_sz = 0;
		int te;

		if (!(partition_map & (1ULL << pp)))
			continue;

		pci_dbg(pdev, "Processing partition %d\n", pp);

		mmio_peer_ctrl = &mmio_ctrl[pp];

		table_sz = ioread16(&mmio_peer_ctrl->req_id_table_size);
		if (!table_sz) {
			pci_warn(pdev, "Partition %d table_sz 0\n", pp);
			continue;
		}

		if (table_sz > 512) {
			pci_warn(pdev,
				 "Invalid Switchtec partition %d table_sz %d\n",
				 pp, table_sz);
			continue;
		}

		for (te = 0; te < table_sz; te++) {
			u32 rid_entry;
			u8 devfn;

			rid_entry = ioread32(&mmio_peer_ctrl->req_id_table[te]);
			devfn = (rid_entry >> 1) & 0xFF;
			pci_dbg(pdev,
				"Aliasing Partition %d Proxy ID %02x.%d\n",
				pp, PCI_SLOT(devfn), PCI_FUNC(devfn));
			pci_add_dma_alias(pdev, devfn, 1);
		}
	}

	pci_iounmap(pdev, mmio);
	pci_disable_device(pdev);
}
#define SWITCHTEC_QUIRK(vid) \
	DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_MICROSEMI, vid, \
		PCI_CLASS_BRIDGE_OTHER, 8, quirk_switchtec_ntb_dma_alias)

SWITCHTEC_QUIRK(0x8531);  /* PFX 24xG3 */
SWITCHTEC_QUIRK(0x8532);  /* PFX 32xG3 */
SWITCHTEC_QUIRK(0x8533);  /* PFX 48xG3 */
SWITCHTEC_QUIRK(0x8534);  /* PFX 64xG3 */
SWITCHTEC_QUIRK(0x8535);  /* PFX 80xG3 */
SWITCHTEC_QUIRK(0x8536);  /* PFX 96xG3 */
SWITCHTEC_QUIRK(0x8541);  /* PSX 24xG3 */
SWITCHTEC_QUIRK(0x8542);  /* PSX 32xG3 */
SWITCHTEC_QUIRK(0x8543);  /* PSX 48xG3 */
SWITCHTEC_QUIRK(0x8544);  /* PSX 64xG3 */
SWITCHTEC_QUIRK(0x8545);  /* PSX 80xG3 */
SWITCHTEC_QUIRK(0x8546);  /* PSX 96xG3 */
SWITCHTEC_QUIRK(0x8551);  /* PAX 24XG3 */
SWITCHTEC_QUIRK(0x8552);  /* PAX 32XG3 */
SWITCHTEC_QUIRK(0x8553);  /* PAX 48XG3 */
SWITCHTEC_QUIRK(0x8554);  /* PAX 64XG3 */
SWITCHTEC_QUIRK(0x8555);  /* PAX 80XG3 */
SWITCHTEC_QUIRK(0x8556);  /* PAX 96XG3 */
SWITCHTEC_QUIRK(0x8561);  /* PFXL 24XG3 */
SWITCHTEC_QUIRK(0x8562);  /* PFXL 32XG3 */
SWITCHTEC_QUIRK(0x8563);  /* PFXL 48XG3 */
SWITCHTEC_QUIRK(0x8564);  /* PFXL 64XG3 */
SWITCHTEC_QUIRK(0x8565);  /* PFXL 80XG3 */
SWITCHTEC_QUIRK(0x8566);  /* PFXL 96XG3 */
SWITCHTEC_QUIRK(0x8571);  /* PFXI 24XG3 */
SWITCHTEC_QUIRK(0x8572);  /* PFXI 32XG3 */
SWITCHTEC_QUIRK(0x8573);  /* PFXI 48XG3 */
SWITCHTEC_QUIRK(0x8574);  /* PFXI 64XG3 */
SWITCHTEC_QUIRK(0x8575);  /* PFXI 80XG3 */
SWITCHTEC_QUIRK(0x8576);  /* PFXI 96XG3 */
SWITCHTEC_QUIRK(0x4000);  /* PFX 100XG4 */
SWITCHTEC_QUIRK(0x4084);  /* PFX 84XG4  */
SWITCHTEC_QUIRK(0x4068);  /* PFX 68XG4  */
SWITCHTEC_QUIRK(0x4052);  /* PFX 52XG4  */
SWITCHTEC_QUIRK(0x4036);  /* PFX 36XG4  */
SWITCHTEC_QUIRK(0x4028);  /* PFX 28XG4  */
SWITCHTEC_QUIRK(0x4100);  /* PSX 100XG4 */
SWITCHTEC_QUIRK(0x4184);  /* PSX 84XG4  */
SWITCHTEC_QUIRK(0x4168);  /* PSX 68XG4  */
SWITCHTEC_QUIRK(0x4152);  /* PSX 52XG4  */
SWITCHTEC_QUIRK(0x4136);  /* PSX 36XG4  */
SWITCHTEC_QUIRK(0x4128);  /* PSX 28XG4  */
SWITCHTEC_QUIRK(0x4200);  /* PAX 100XG4 */
SWITCHTEC_QUIRK(0x4284);  /* PAX 84XG4  */
SWITCHTEC_QUIRK(0x4268);  /* PAX 68XG4  */
SWITCHTEC_QUIRK(0x4252);  /* PAX 52XG4  */
SWITCHTEC_QUIRK(0x4236);  /* PAX 36XG4  */
SWITCHTEC_QUIRK(0x4228);  /* PAX 28XG4  */

Do you mean above was running on jetpack4 but not happened on jetpack5?

Could you elaborate how you find out these?

These are present in Jetpack 5 and not in Jetpack 4

Hi,

Just clarify. I am asking you to directly check the pci_fixup_device and pci_do_fixups. Compare the case inside of it and look into which quirks have diff between jetpack4 and jetpack5.

This needs to be checked on runtime but not just read and compare some codes.
Are you saying you check it during runtime and these were executed in jp5 but not happened in jp4?

Is there anything that executed in jp4 but not happened in jp5?

Hello,

Could you give a quick answer about above questions? If you are still running some test, please also tell us.

Need to keep this active otherwise it would be closed by system.

Hi @WayneWWW,

Sorry, we were off for President’s Day. I am still looking into it and will respond in a day or so, thank you for understanding.

1 Like

Hi @WayneWWW,

I was checking the quirks which are used while enumerating the micron NVMe in the boot log after adding print code to fixup function, but I find none. The one quirk which showed up was created by me which was mentioned in previous comments. (nvme_disable_and_flr).

quirk_switchtec_ntb_dma_alias is being called after enumeration of pcie devices at DECLARE_PCI_FIXUP_CLASS_FINAL.

What can we try in terms of adding quirks for switch/nvme? Is it going to related to BAR addressing, PCIe capabilities workaround etc?

Thanks