PCIe Bus Error

I am trying to get a m.2 video capture card to work on my jetson nx.

I noticed i was getting some PCIe Bus Errors so I set pcie_aspm=off and pci=nomsi on the append line of extlinux.conf.

This seemed to reduce the amount of errors. However, I see that as soon as the overcurrent warning kicks in because of the temperature, the error comes back. See image.

I doubt I can help, but one minor possibility comes to mind: Is this NX powered via USB or via barrel jack connector? If by USB, then try switching to a higher power barrel jack supply. My thought is that an overcurrent is probably going to trigger even with a beefier power supply, but it is worth checking if a lower power USB power supply is used. Also, if temperature is too high, you might see if putting this under a large fan delays the onset of the issue.

Unfortunately I don’t know of any m.2 adapter which would allow you to test separate power delivery to the m.2 card without using the Jetson’s power bus, but that would be a useful step if it were available.

Hi @linuxdev,

I’m using a barrel connector. The PCI errors start from i think about 55 degrees and then the unit will shutdown without fail when 65 degrees and with m.2 card attached. If no m.2 card, then it will be fine. If I have a Jetson fan running above 60% it will be fine.

I have set these in extlinux.conf - pcie_aspm=off pci=nomsi pci=noaer iommu=1 amd_iommu=on ASPCM=off

But i still get errors. What can i do to disable the mechanisms that are causing the PCI error reports?

Ultimately, I’m just trying to get my NX to not shutdown automatically which seems to be linked somewhat to these errors (or the errors could be a clue to some other problem)

You would need more details on the errors, although there might not be anything available short of somehow using external power or better cooling. To see a verbose PCI error and details after the error has started, and to create a log of the error:
sudo lspci -vvv
(if you have more than one PCI device you can limit to just the one device with the “-s <slot>” option)

The trick is of course that you have to run that command after an error, but if the error is shutting things down, then this might not be possible. Definitely the best place to run this command is from a serial console since this allows logging of the session and you can also run the lspci command from a host which won’t go down. I tend to use gtkterm for serial console, and I like the logging feature.

Note that most of the time thermal temperature issues won’t be something you can solve with simple software changes. Also, the “pci=noaer” might have more effect on error messages than actually stopping the failure (not seeing an error message won’t necessarily change the fate of shutdown). Possibly the verbose lspci will show more if the “noaer” is not used.

I have found that the last dmesg report before my system does an unwanted shutdown is rcu_preempt self-detected stall on CPU. Am I correct in saying this shutdown could be because the a result of Linux kernel deadlock due to so many error messages from the PCIE m.2 card?

I tested again with no RTSP stream output. This time the final error in dmesg was as follows:

In case this helps,

vizgardNX:~ ➜ sudo lspci -vvv
[sudo] password for vizgard:
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 35
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: 40000000-400fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <1us, L1 <64us
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt-
LnkSta: Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootCap: CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
Vector table: BAR=2 offset=00000000
PBA: BAR=2 offset=00010000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [148 v1] #19
Capabilities: [168 v1] #26
Capabilities: [190 v1] #27
Capabilities: [1c0 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=40us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=60us
L1SubCtl2: T_PwrOn=40us
Capabilities: [1d0 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?> Capabilities: [2d0 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [308 v1] #25
Capabilities: [314 v1] Precision Time Measurement
PTMCap: Requester:+ Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [320 v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport

0005:01:00.0 Multimedia video controller: Nanjing Magewell Electronics Co., Ltd. Device 0053 (rev 01)
Subsystem: SafeNet (wrong ID) Device 0000
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 556
Region 0: Memory at 1f40000000 (32-bit, non-prefetchable) [size=256K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fffff000 Data: 0000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: Eco Capture
Kernel modules: MWEcoCapture

I logged the number of system interrupts while the system is stressed (AI running + m.2 video capture card connected) and then captured another log 60 seconds later. I compared the results and listed the items that had the most interrupts (from the top) in the table below. The eco-capture, tegra-pcie-msi and tegra-oc-event I am sure are contributing to my jetson shutting down. However, there are some items that have considerably more interrupts. Can someone help reduce any of these interrupts?

FYI, it would be better to use a serial console log instead of screenshots. They’re much more complete, and searchable. Plus I could quote particular log lines, but it is hard to quote a particular line from a screenshot.

Note that when you paste long logs you can highlight this and use the “code” icon (looks like “</>”) to get scrollbars and have it preserve whitespace. You can also use the “pencil” icon to edit your existing post, and add three backquote characters above and below the block of log and this would do the same thing. The three backquotes looks like:
```

That seems likely. I don’t think the deadlock (if that is the specific issue) could be resolved in software.

Regarding the verbose lspci, the bridge is capable of PCIe v3 speeds (LnkCap 8GT/s), but is running at PCIe v2 speeds (LinkSta 5GT/s). This is not an error and should work. The “Nanjing Magewell Electronics Co., Ltd. Device” is only capable of PCIe v2 speeds, and this is what the bridge is running at. Since the end device is running at its maximum PCIe v2 speed I suspect signal quality is good and functioning without errors so far as signal integrity goes.

Is it correct that the verbose lspci log was taken without AER being enabled? It would be nice to see the verbose lspci with AER so we can see what errors were showing up.

I do not know of a way to reduce interrupts. In the case of hardware IRQs this is probably due to hardware activity…the instruction to “remove the hardware” wouldn’t be useful. In terms of a software IRQ (ksoftirq) this would be something the scheduler runs on a “fairness” basis, but I don’t see that as having any effect on what you are working on.

You could possibly being “IRQ starvation” from the first CPU core being unable to properly service all of the hardware IRQs it is seeing, but it feels like something else (I doubt shutdown is from IRQs not being serviced fast enough). Do note though that almost all hardware IRQs can only be serviced by CPU0 since hardware IRQs mostly go only to that CPU, and so the Jetson is probably more susceptible to such a problem than other multicore systems which can distribute hardware IRQs to other cores. If you really wanted to deal with this, then what you’d want to do is reduce as much load as possible on CPU0 and move it to other cores, and that implies moving only software IRQs.

My thought is that the driver has some issue specific to the arm64/aarch64 driver.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.