Jetson Crashes after PCIe operation

I have a PCIe card with a Custom DMA. When I am running some test applications on Jetson, which are working in x86, I am getting the following error log and the system is getting crashed.
I have set nvpmodel mode to 0 and enabled jetson_clocks.

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

Error log

[  +0.000001] pcieport 0000:00:01.0: AER: Multiple Uncorrected (Fatal) error received: id=0020
[ +11.657250] Watchdog detected hard LOCKUP on cpu 0
[  +0.004644] ------------[ cut here ]------------
[  +0.004822] WARNING: at ffffffc000142ef0 [verbose debug info unavailable]
[  +0.006793] Modules linked in: mycard(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay bcmdhd pci_tegra bluedroid_pm

[  +0.020902] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G           O    4.4.38 #1
[  +0.007145] Hardware name: quill (DT)
[  +0.003676] task: ffffffc1ece83e80 ti: ffffffc1ecea0000 task.ti: ffffffc1ecea0000
[  +0.007498] PC is at watchdog_timer_fn+0x200/0x360
[  +0.004802] LR is at watchdog_timer_fn+0x200/0x360
[  +0.004800] pc : [<ffffffc000142ef0>] lr : [<ffffffc000142ef0>] pstate: 600001c5
[  +0.007397] sp : ffffffc1ecea3af0
[  +0.003323] x29: ffffffc1ecea3af0 x28: ffffffc001282130 
[  +0.005347] x27: ffffffc00124e250 x26: ffffffc001281b10 
[  +0.005346] x25: ffffffc001281000 x24: ffffffc1ecea3dc0 
[  +0.005340] x23: 0000000000000000 x22: 000000000000022b 
[  +0.005342] x21: ffffffc1f5ff1278 x20: ffffffc00124e2d8 
[  +0.005343] x19: ffffffc00124e000 x18: 0000000000000000 
[  +0.005342] x17: 000000000000a202 x16: 0000000000000001 
[  +0.005343] x15: 0000000000000010 x14: ffffffc08144038f 
[  +0.005341] x13: ffffffc00144039d x12: 0000000000000005 
[  +0.005341] x11: ffffffc1ecea3af0 x10: 0000000005f5e0ff 
[  +0.005344] x9 : 0000000000000454 x8 : ffffffc0002e1720 
[  +0.005342] x7 : ffffffc0012c3c88 x6 : 0000000000000035 
[  +0.005339] x5 : 0000000000000000 x4 : 0000000000000000 
[  +0.005340] x3 : 0000000000000000 x2 : ffffffc1ecea0000 
[  +0.005339] x1 : 0000000000010001 x0 : 0000000000000026 

[  +0.007431] ---[ end trace 4bb1ceebc1a9b188 ]---
[  +0.004623] Call trace:
[  +0.002466] [<ffffffc000142ef0>] watchdog_timer_fn+0x200/0x360
[  +0.005853] [<ffffffc000109f04>] __hrtimer_run_queues+0x13c/0x370
[  +0.006104] [<ffffffc00010a7f0>] hrtimer_interrupt+0xa0/0x1d8
[  +0.005762] [<ffffffc00094bb3c>] tegra186_timer_isr+0x24/0x30
[  +0.005756] [<ffffffc0000f6b1c>] handle_irq_event_percpu+0x6c/0x2a0
[  +0.006274] [<ffffffc0000f6d98>] handle_irq_event+0x48/0x78
[  +0.005583] [<ffffffc0000fa330>] handle_fasteoi_irq+0xb8/0x1b0
[  +0.005840] [<ffffffc0000f614c>] generic_handle_irq+0x24/0x38
[  +0.005751] [<ffffffc0000f644c>] __handle_domain_irq+0x5c/0xb8
[  +0.005843] [<ffffffc000081774>] gic_handle_irq+0x64/0xc0
[  +0.005407] [<ffffffc000084740>] el1_irq+0x80/0xf8
[  +0.004805] [<ffffffc00082f410>] cpuidle_enter+0x18/0x20
[  +0.005323] [<ffffffc0000e907c>] call_cpuidle+0x24/0x50
[  +0.005232] [<ffffffc0000e9318>] cpu_startup_entry+0x270/0x340
[  +0.005841] [<ffffffc00008e10c>] secondary_start_kernel+0x12c/0x168
[  +0.006272] [<0000000080081adc>] 0x80081adc
[Mar27 06:09] Watchdog detected hard LOCKUP on cpu 4
[  +0.004646] ------------[ cut here ]------------
[  +0.004825] WARNING: at ffffffc000142ef0 [verbose debug info unavailable]
[  +0.006794] Modules linked in: mycard(O) fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack nf_nat br_netfilter overlay bcmdhd pci_tegra bluedroid_pm

[  +0.020902] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G        W  O    4.4.38 #1
[  +0.007144] Hardware name: quill (DT)
[  +0.003679] task: ffffffc1ece82580 ti: ffffffc1ece98000 task.ti: ffffffc1ece98000
[  +0.007501] PC is at watchdog_timer_fn+0x200/0x360
[  +0.004803] LR is at watchdog_timer_fn+0x200/0x360
[  +0.004802] pc : [<ffffffc000142ef0>] lr : [<ffffffc000142ef0>] pstate: 600001c5
[  +0.007398] sp : ffffffc1ece9baf0
[  +0.003325] x29: ffffffc1ece9baf0 x28: ffffffc001282130 
[  +0.005348] x27: ffffffc00124e250 x26: ffffffc001281b10 
[  +0.005346] x25: ffffffc001281000 x24: ffffffc1ece9bdc0 
[  +0.005346] x23: 0000000000000000 x22: 0000000000000241 
[  +0.005343] x21: ffffffc1f5fcf278 x20: ffffffc00124e2d8

What would be the problem? The same setup is working flawlessly in x86.

Can we have some more info like
-> Status of ASPM enablement (would be good to see ‘sudo lspci -vvv’ output)
-> How much time does it take for the system to hang? What is the operation being done by DMA? Is it write to system memory or read from system memory? Are there any other operations performed in system when this is happening?

Output of lspci -vvv

01:00.0 Processing accelerators: Device xxxx:xxxx (rev 01)
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 452
	Region 0: Memory at 50100000 (64-bit, non-prefetchable) [disabled] 
	Region 2: Memory at 50200000 (64-bit, non-prefetchable) [disabled] 
	Region 4: [virtual] Memory at 50300000 (32-bit, non-prefetchable) 
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
  • It will hang after a few mSec (~50mS).
  • The descriptors are written to the Card, and the card should do a read from System Memory, but the system hangs.
  • No other operations are being done at the time.

OK.
It is still difficult to understand why the system is crashing. Is it possible to share the driver with us in private? so that we can take a look at it and see if there is something in that which doesn’t work for Tegra?

We solved the crash issue by setting pci=nomsi, disabling smmu and by removing all IRQ in the driver. But still we would like to know the root cause of this issue. Why would AER crash the system? Any thoughts?
We are pinning the memory, sending the descriptors to the FPGA, Read from the memory (MRead), write to the memory (MWrite) and then Unpin the memory.
And one another thing that we noticed is that, after 2-3 successful DMA transaction, when the FPGA requests a Memory Read, the Host is not responding. The TLP and the addresses seems correct. What could be the issue?
Is there a way to debug the Memory Read Request failure? Why would the Host not respond?

AER is just a mechanism to report errors. It won’t cause any crash as such.
You mentioned that you passed ‘pci=nomsi’, ‘disabled SMMU’. Can you please do only one thing at a time and tell us which one is really helping.
BTW, I didn’t understand this part ‘by removing all IRQ in the driver’. What exactly are you doing as part of this?
This seems very specific to your driver and device combination as all the upstreamed drivers work fine on Tegra.

Sorry Vidyas, for not being clear.

Crash Issue

When pci=nomsi was set, it was not able to allocate any IRQ. That’s why we removed the IRQ as we were not using the IRQ as of now. We were just allocating and enabling the IRQ.

The crash issue was solved when we disabled the IRQ section in the driver code. It was not solved when pci=nomsi. Really sorry for not finding that soon enough.

in probe function, we does the following.

err = pci_enable_msi_range(pdev, 1, 1);
if (err < 0) {
	dev_err(dev, "Unable to allocate irq");
	goto out_pci_disable;
}

irq_set_status_flags(pdev->irq, IRQ_NOAUTOEN);
err = request_irq(pdev->irq, mycard_interrupt_handler, 0, DRV_MODULE_NAME, mycard_dev);
if (err) {
	dev_err(dev, "Irq %u in use: %d\n", pdev->irq, err);
	goto out_alloc_irq_vector;
}

enable_irq(pdev->irq);

....

out_request_irq:
	free_irq(pdev->irq, mycard_dev);
out_alloc_irq_vector:
	pci_disable_msi(pdev);
out_pci_disable:
	pci_disable_device(pdev);

When we comment out the enable_irq(pdev->irq), the system is no longer getting crashed. We are not using the IRQ for anything else. It was allocated for future implementations.

DMA Issue

The DMA was working only when the SMMU is disabled.
If the SMMU is enabled, the Memory Read from System fails.
We are doing a simple loop back using the DMA.

  • That is we will write some descriptors to the card, using Memory Write to the Card from the Host
  • Then the Card will fetch data using the address specified in the descriptors using a Memory Read from the Host
  • Then card will use the Write Address from the descriptors and write back the data using memory Write to the Host.

After some successful (initial) DMA transaction, when the card is trying to fetch the data from the System Memory using Mem Read, the Host is not responding, but we are getting a Malformed TLP error (Please find the log below). What would cause such a behaviour? Is there any thing that we could do to debug this issue?

[  +0.007225] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: id=0020
[  +0.007961] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0008(Receiver ID)
[  +0.011457] pcieport 0000:00:01.0:   device [10de:10e5] error status/mask=00040000/00000000
[  +0.008577] pcieport 0000:00:01.0:    [18] Malformed TLP          (First)
[  +0.007055] pcieport 0000:00:01.0:   TLP Header: 20000020 010000ff 00000000 e5feb000
[  +0.008337] pcieport 0000:00:01.0: broadcast error_detected message
[  +0.006411] my_pci_card 0000:01:00.0: device has no AER-aware driver
[  +0.898176] pcieport 0000:00:01.0: Root Port link has been reset
[  +0.006112] pcieport 0000:00:01.0: AER: Device recovery failed

Apologies for the delay in reply.
It doesn’t make sense that calling enable_irq() would crash the system. Is it possible that endpoint device might be sending some interrupts periodically to host and those are not handled properly in the “mycard_interrupt_handler()” ISR and hence it is crashing?
Also, I don’t understand why IRQ was not getting when ‘pci=nomsi’ was passed. The system should be able to allocate legacy IRQ in that. This again seems to be fishy.

As far as the Malformed TLP is concerned, your device is sending a 64-bit format memory read TLP request with upper bits (i.e. [63:32]) as zero which it shouldn’t have sent as per the PCIe spec. Since the address is only 0xe5feb000, it should’ve used 32-bit format MRd TLP request.
Refer PCIe spec Rev 4.0 Ver 1.0 Section 2.2.4.2 Address-Based Routing Rules which says the following

  • Memory Read, Memory Write, and AtomicOp Requests can use either format.
    • For Addresses below 4 GB, Requesters must use the 32-bit format. The behavior of the Receiver is not specified if a 64-bit format request addressing below 4 GB (i.e., with the upper 32 bits of address all 0) is received

I think you may have to configure your PCIe card’s custom DMA engine to smartly generate 32-bit/64-bit format MRd TLP requests based on target address.