PCI MSI interrupt generation issue in Jetson

Hello experts,
We have an FPGA connected as endpoint device to Nvidia Jetson Xavier NX. And we have developed the PCIe driver for Jetson.

We are using an MSI interrupt for generating an interrupt. This interrupt is sent by FPGA to Jetson when FPGA is done transferring data to Jetson.

In our driver code, we have registered an interrupt vector in probe function like shown below (for loop is for assigning multiple vectors, but currently only using one):

static int fpga_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
	...
	...
	/* PCI interrupt handling */
	ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI);

	if (ret < 0)
	{
		dev_err(drvdata->dev, "pci_alloc_irq_vectors failed\n");
		goto pci_err_int;
	}
	else
	{
		dev_info(drvdata->dev, "Number of vectors assigned: %d\n", ret);
	}

	irq = pci_irq_vector(pdev, 0);

	for (i = 0; i < 1; i++)
	{
		ret = request_irq((irq + i), irq_handler, IRQF_SHARED, "fpga", drvdata);

		if (ret < 0)
		{
			dev_err(drvdata->dev, "request_irq failed\n");
			goto pci_err_int;
		}
		else
		{
			dev_info(drvdata->dev, "IRQ: %d\n", (irq + i));
		}
	}
	...
	...
}

1 vector is successfully assigned.
The interrupt is registered successfully as we can see them in /proc/interrupts file:

811:       0          0          0          0   PCI-MSI    0 Edge      fpga

and the registered IRQ handler which will be called when MSI interrupt is raised from endpoint (FPGA) is defined as shown below:

static irqreturn_t irq_handler(int irq, void *cookie)
{
	pr_info("====> IRQ Handled: IRQ number %d\n", irq);
	return IRQ_HANDLED;
}

lspci’s verbose output is shown below:

root@user:/home/ubuntu# lspci -vvv
0005:01:00.0 Unassigned class [ff00]: Altera Corporation Device e001
	Subsystem: Altera Corporation Device e001
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 811
	Region 0: Memory at 1c00000000 (64-bit, prefetchable) [size=32K]
	Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fffff000  Data: 0000
	Capabilities: [78] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [80] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #1, Speed 5GT/s, Width x2, ASPM not supported, Exit Latency L0s <4us, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x2, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [200 v1] Vendor Specific Information: ID=1172 Rev=0 Len=044 <?>
	Kernel driver in use: fpga

The config space:

root@user:/home/ubuntu# lspci -xxx
0005:01:00.0 Unassigned class [ff00]: Altera Corporation Device e001
00: 72 11 01 e0 06 04 10 00 00 00 00 ff 00 00 00 00
10: 0c 00 00 00 1c 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 72 11 01 e0
30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 05 78 81 00 00 f0 ff ff 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 11 78 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 01 80 03 00 00 00 00 00
80: 10 00 02 00 01 80 00 00 30 28 00 00 22 60 40 01
90: 00 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 10 00 10 00 00 00 00 00 06 00 00 00
b0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

When FPGA is done transferring the data to DMA addresses allocated by Jetson, it writes data (0x0000) to the MSI address (0x00000000fffff000) which is displayed in lspci’s output as well as in config space.

But Jetson is somehow not able to receive the interrupt. Counter values in /proc/interrupts files stays 0 and irq_handler is also not executed.

We think issue is probably related to Jetson since we can see in FPGA’s software that it is writing data (0x0000) to the MSI address (0x00000000fffff000).

Is there anything missing in driver code or kernel related to MSI interrupt handling?

Our team will do the investigation and provide suggestions soon. Thanks

Hi Kay,
Any update on this issue?

Hi,
I was tracing the interrupt registration logic in kernel and found following things.
/proc/interrupts shows following entries. 33 and 34 are for the PCIE-4 slot, 35 and 36 are for PCIE-5 slot where we connect our FPGA.

33:          1          0          0          0     GICv2   83 Level     tegra-pcie-intr, PCIe PME, aerdrv
34:      13553          0          0          0     GICv2   84 Level     tegra-pcie-msi
35:          1          0          0          0     GICv2   85 Level     tegra-pcie-intr, PCIe PME, aerdrv
36:          0          0          0          0     GICv2   86 Level     tegra-pcie-msi
555:     13553          0          0          0   PCI-MSI    0 Edge      rtl88x2ce
811:         0          0          0          0   PCI-MSI    0 Edge      fpga

IRQ 33 and IRQ 35 are triggered only once.
Question 1: Why are they triggered only once? It says its a controller interrupt in device tree, but what it is exactly?

IRQ 34 and IRQ 555 seems to be incrementing as and when Wi-Fi chip connected to that slot sends MSI interrupts, but IRQ 36 and IRQ 811 are not incrementing even once.

In kernel/nvidia/drivers/pci/dwc/pcie-tegra.c I checked the tegra_pcie_dw_probe() function and found that it is registering tegra-pcie-intr (IRQ 33 and IRQ 35), which calls tegra_pcie_irq_handler() irq handler, which like I said is called once. The handler calls tegra_pcie_rp_irq_handler() which is performing some error-checking.

The tegra_pcie_dw_probe() then calls tegra_pcie_config_rp() which registers tegra-pcie-msi (IRQ 34 and IRQ 36) which calls tegra_pcie_msi_irq_handler() irq handler upon interrupt. Finally, dw_handle_msi_irq() (in kernel/kernel-4.9/drivers/pci/dwc/pcie-designware-host.c) is called from tegra_pcie_msi_irq_handler() which will execute the irq handler that is registered in driver (IRQ 555 and IRQ 811). It seems this path is followed normally in case of IRQ 34 and IRQ 555 (i.e., PCIE-4, WiFi slot), but in our case, which would be IRQ 36 and IRQ 811, tegra_pcie_msi_irq_handler() is not getting called even once.
In dw_handle_msi_irq() when I am printing the value of pp->irq it prints 33 but not 35 (since I guess dw_handle_msi_irq() doesn’t get executed in our case).

Question 2: Does it mean the interrupt isn’t reaching the root port/PCI bridge itself? Any way to trace it even further than this?

One experiment we tried was, instead of using the MSI address provided in the config space, we transmitted MSI data directly to 0x00000000. Usually whenever FPGA tries to write at any arbitary memory address which is not allocated using DMA, context fault is triggered like:

[  447.939864] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x00000000, fsynr=0x140011, cb=1, sid=91(0x5b - PCIE5), pgd=0, pud=0, pmd=0, pte=0

But, using 0x00000000 as an MSI address to transmit the MSI data did not trigger this.

Question 3: Since MSI is just a simple posted write, wouldn’t it trigger context fault if FPGA has sent MSI data to 0x00000000?

We even tried enabling 64 bit MSI address in dw_pcie_msi_init() (in kernel/kernel-4.9/drivers/pci/dwc/pcie-designware-host.c) by replacing

err = dma_set_coherent_mask(dev, DMA_BIT_MASK(32));

with

err = dma_set_coherent_mask(dev, DMA_BIT_MASK(64));

Question 4: Is there anything missing in driver code for registering MSI interrupt as shown in the first post?

Thanks,
Meet

Hi meet.patel & NVIDIA

We are struggling with a similar issue, thanks for posting your work.
What I was wondering, did you tested the wifi module on the PCIE-5 bus?
If not, I can do this and see if the MSI interrupts works on the PCIE-5 bus with the wifi module.

Our FPGA module wasn’t recognized on the PCIE-4 bus.
I spend a lot of time in figuring out why, but didn’t had any issue.
On the PCIE-5 bus, the module was recognized instantly.

Thanks,

Hans

Hi meet.patel,

I was able to fix the issue by calling pci_set_master() after the request_irq function.

Regards,

Hans

Hi Hans,

We tested by calling pci_set_master() after the request_irq function but still counter values in /proc/interrupts files stays 0 when FPGA writes data (0x0000) to the MSI address (0x00000000fffff000).

Can you pleas suggest what else we can check? or what could be the issue?

Let us know if you need more information.

Thanks,
Meet

Hi,

I don’t see any reason for MSI not work if FPGA sends memory write to Address: 00000000fffff000 with Data: 0000.
Can you double check that FPGA is indeed sending this memory write request? Can you capture LA traces and confirm?

Question 1: Why are they triggered only once? It says its a controller interrupt in device tree, but what it is exactly?
IRQ 33 and 35 are PCIe system interrupt & legacy INTx, you can ignore this for MSI.
For MSI, you have to consider only 36.

Question 2: Does it mean the interrupt isn’t reaching the root port/PCI bridge itself? Any way to trace it even further than this?
We have to capture LA to check if memory write is on bus or not?

Question 3: Since MSI is just a simple posted write, wouldn’t it trigger context fault if FPGA has sent MSI data to 0x00000000?
It will trigger smmu fault

Question 4: Is there anything missing in driver code for registering MSI interrupt as shown in the first post?
Drive code looks fine. Since you are using only one MSI, try pci_enable_msi()

Thanks,
Manikanta

Hi Manikanta,
Thanks for the reply.
Turns out, there were some configuration changes required in FPGA for the MSI interrupt to work.
Also, I tried pci_enable_msi() too. But since the documentation says that API is deprecated for newer kernel version, I have implemented interrupt handling using pci_alloc_irq_vectors() and pci_irq_vector().
Thanks for clearing doubts regarding rest of the queries.
Regards,
Meet

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.