hello The system version is R35.3.1. We communicate with PCIE and FPGA, but after running for an hour, the following error occurred. Have you encountered it?
8209.3560697xdma:xdma xfer submit: xdev0x00000000bfb228cd,offline
hello The system version is R35.3.1. We communicate with PCIE and FPGA, but after running for an hour, the following error occurred. Have you encountered it?
8209.3560697xdma:xdma xfer submit: xdev0x00000000bfb228cd,offline
Someone else will need to answer for this case, but I want to add this detail: Sometimes an FPGA will not be fully booted at the moment during Jetson boot in which PCIe detection occurs. Because Jetsons are designed by default to reduce power consumption, this would shut off PCIe, and a later booting FPGA wonāt be detected. It is possible to force PCIe to stay on.
I donāt know, and couldnāt tell you how to check (thereād be files in ā/sys
ā, I donāt know which) whereby you could find out if this is the result of a sleep mode or a low power state conservation. Donāt know for sure.
thanks for your replyļ¼Compared to TX2+FPGA, I used the same FPGA, the same pin X1 mode, and the same FPGA program. TX2 will not have PCIE offline in R28.1 system. However, the current AGX orin 35.3.1 will run for 1 hour before PCIE OFFLINE appears. By the way, we use streaming mode for FPGA.
When a PCIE OFFLINE error occurs, I reload the driver, but it cannot solve the problem. Is there any way to successfully restart?
Power outage ORIN can work normally
Even within the TX2, R28.1 is very very old. It was the first production 64-bit install (R24.x had the first 64-bit ARM CPUs, but was partially 32-bit compatibility mode in user space, and 64-bit in most of the kernel; this was improved over R24.x+; R28.1 was the very first āproductionā 64-bit in both user space and kernel space). In that early code, actual U-Boot was used. Later on, in the more modern R32.x, CBoot (which existed in the R28.x as well) had some U-Boot functionality ported to that, and actual U-Boot was removed. This makes for many changes during boot stages.
Moving on to the era of Orin, the R34.x+ is entirely UEFI, and most of the boot content has been completely reworked. Orin only works with R34.x (you have to get to R35.x for it to be considered a production release), and thus has almost nothing in common with any TX2 in terms of boot setup. Additionally, Those earlier releases used a 4.x series Linux kernel, while Orin (35.x) has a 5.x series Linux kernel. PCIe, USB, and many other parts (especially boot setup) are quite different. So comparing is a bit of an āapples and orangesā comparison.
Someone from NVIDIA would need to comment, but I think finding a way to stop PCIe autosuspend in R35.x of an AGX Orin would be of interest. Iām thinking this might be what the problem is. The log line you submitted is useful, but to be truly useful you would need the entire serial console boot log. It would be good to show any setup which occurred prior to that point, up to and including that time. At some point someone might suggest adding the debug version of the UEFI boot software for more serial console content, but for now, I suggest adding the full serial console log up to and including that point.
Can someone from NVIDIA suggest how to make sure PCIe never suspends on Orin AGX?
helloļ¼
`0001:00:00.0 PCI bridge: NVIDIA Corporation Device 229e (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 52
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled]
Memory behind bridge: 40000000-405fffff [size=6M]
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x1, ASPM not supported
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP+, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd+
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 1ms to 10ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=0 offset=00000000
PBA: BAR=0 offset=00000000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [148 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [17c v1] Lane Margining at the Receiver <?>
Capabilities: [190 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=40us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=10us
L1SubCtl2: T_PwrOn=10us
Capabilities: [1a0 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2a0 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [2d8 v1] Data Link Feature <?>
Capabilities: [2e4 v1] Precision Time Measurement
PTMCap: Requester:- Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [2f0 v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Capabilities: [358 v1] Vendor Specific Information: ID=0006 Rev=0 Len=018 <?>
Kernel driver in use: pcieport
0001:01:00.0 Serial controller: Xilinx Corporation Device 7021 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 312
Region 0: Memory at 20a8000000 (32-bit, non-prefetchable) [size=4M]
Region 1: Memory at 20a8400000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 000000000f410040 Data: 0260
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: xdma`
ORIN is 1ms-10ms DevCtl2: Completion Timeout: 1ms to 10ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
FPGA XDMA is 50us-50ms
Is this the reason for the occurrence of OFFLINE?
I found that TX2 is also 50us-50ms
If thatās the reason, how should I modify it
I canāt fully answer this, NVIDIA will need to respond. I do notice though that there is no AER (Advanced Error Reporting) on the FPGA. This means that if there is an error, then it wonāt be corrected through that mechanism. It could be somewhat more āfragileā than a device with AER, although technically I donāt actually see an error on the FPGA. It is operating at PCIe v2 speeds, and if something were truly wrong, it would have reverted to PCIe v1.
Where is it that you see āOFFLINE
ā? I donāt see it in that verbose lspci
(which is useful). Is that lspci
from when it was working? Is it possible to see the same thing (if it isnāt already that) when the OFFLINE
shows up?
[ 6044.929433] pcieport 0001:00:00.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0001:00:00.0
[ 6044.929464] pcieport 0001:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 6044.940956] pcieport 0001:00:00.0: device [10de:229e] error status/mask=00004000/00400000
[ 6044.949556] pcieport 0001:00:00.0: [14] CmpltTO (First)
[ 6044.956593] xdma:xdma_error_resume: dev 0x000000006b25c02a,0x000000008a1e47f8.
[ 6044.956607] pcieport 0001:00:00.0: AER: device recovery successful
This is the printed information
Why didnāt the official personnel come?
I see AER for the PCI bridge. I donāt see it in sudo lspci -vvv
for the Xilinx device. Notice in the verbose lspci
that this is the bridge slot number:
0001:00:00.0 PCI bridge: NVIDIA Corporation Device 229e (rev a1) (prog-if 00 [Normal decode])
In those logs you quoted, this is the same slot, so the AER is part of the PCIe bus itself, and not part of the Xilinx.
Even so, I donāt know the consequences of this. The Xilinx does not have that recovery mechanism, but the bridge āfeedingā the Xilinx does, and so I think this correction is an error detected by the bridge as a result of the Xilinx. The Xilinx itself does not have AER, so it is not capable of reporting some errors. It would be nice if it had this since it might provide a hint at what the issue is. It might not even be an issue, it is hard to tell. I do see ASPM disabled, which is good for testing purposes, but I have no conclusion. It looks like it is working, and the driver is loading, but we wonāt be able to see the original error, and we wonāt be able to tell what data or condition is triggering that PCIe bridge issue (it says corrected, but correcting might throw away something important).
Someone else may need to answer this, but if you can find any information on the Xilinx device regarding Advanced Error Reporting (AER), maybe it could be enabled (or perhaps it isnāt possible). Someone from NVIDIA may be able to look up the initial error on the PCIe bridge and get a hint on the original cause.
sq@anonymous:~$ sudo lspci -vvv
0001:00:00.0 PCI bridge: NVIDIA Corporation Device 229e (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 52
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled]
Memory behind bridge: 40000000-405fffff [size=6M]
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x1, ASPM not supported
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP+, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd+
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 1ms to 10ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=0 offset=00000000
PBA: BAR=0 offset=00000000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [148 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [17c v1] Lane Margining at the Receiver <?>
Capabilities: [190 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=40us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=10us
L1SubCtl2: T_PwrOn=10us
Capabilities: [1a0 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2a0 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [2d8 v1] Data Link Feature <?>
Capabilities: [2e4 v1] Precision Time Measurement
PTMCap: Requester:- Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [2f0 v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Capabilities: [358 v1] Vendor Specific Information: ID=0006 Rev=0 Len=018 <?>
Kernel driver in use: pcieport
0001:01:00.0 Serial controller: Xilinx Corporation Device 7021 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 312
Region 0: Memory at 20a8000000 (32-bit, non-prefetchable) [size=4M]
Region 1: Memory at 20a8400000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 000000000f410040 Data: 0260
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: xdma
ls pci ,This is the printed information
thanks
So I will suggest that maybe someone from NVIDIA can look up the meaning of this error (corrected) on the PCIe bridge:
https://forums.developer.nvidia.com/t/pcie-offline/259760/8
(this is the best indicator so far of what the Xilinx might be suffering from)
0001:00:00.0 PCI bridge: NVIDIA Corporation Device 229e (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 52
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 0000f000-00000fff [disabled]
Memory behind bridge: 40000000-405fffff [size=6M]
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x1, ASPM not supported
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP+, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd+
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 1ms to 10ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=0 offset=00000000
PBA: BAR=0 offset=00000000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [148 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [17c v1] Lane Margining at the Receiver <?>
Capabilities: [190 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=40us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=10us
L1SubCtl2: T_PwrOn=10us
Capabilities: [1a0 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2a0 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [2d8 v1] Data Link Feature <?>
Capabilities: [2e4 v1] Precision Time Measurement
PTMCap: Requester:- Responder:+ Root:+
PTMClockGranularity: 16ns
PTMControl: Enabled:- RootSelected:-
PTMEffectiveGranularity: Unknown
Capabilities: [2f0 v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Capabilities: [358 v1] Vendor Specific Information: ID=0006 Rev=0 Len=018 <?>
Kernel driver in use: pcieport
0001:01:00.0 Serial controller: Xilinx Corporation Device 7021 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 312
Region 0: Memory at 20a8000000 (32-bit, non-prefetchable) [size=4M]
Region 1: Memory at 20a8400000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 000000000f410040 Data: 0260
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: xdma`
DevCtl2: Completion Timeout: 1ms to 10ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
How to modify this 1-10ms to 50us-50ms
I have a similar error where I am communicating with an FPGA over the PCIE x16 connection on the development board and after a few minutes the board is powered off and the connection is lost. Hereās the dmesg output:
[ 168.027795] pcieport 0005:00:00.0: AER: Corrected error received: 0005:00:00.0
[ 168.027812] pcieport 0005:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 168.027814] pcieport 0005:00:00.0: device [10de:229a] error status/mask=00000001/0000e000
[ 168.027816] pcieport 0005:00:00.0: [ 0] RxErr (First)
[ 168.078262] pcieport 0005:00:00.0: AER: Uncorrected (Fatal) error received: 0005:00:00.0
[ 168.078279] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[ 168.078281] pcieport 0005:00:00.0: device [10de:229a] error status/mask=00000020/00400000
[ 168.078283] pcieport 0005:00:00.0: [ 5] SDES (First)
[ 168.078294] sdma 0005:81:00.0: AER: canāt recover (no error_detected callback)
[ 168.078296] pci 0005:01:00.1: AER: canāt recover (no error_detected callback)
[ 170.151810] pcieport 0005:00:00.0: Data Link Layer Link Active not set in 1000 msec
[ 170.151823] pcieport 0005:00:00.0: AER: Root Port link has been reset (-25)
[ 170.151830] pcieport 0005:00:00.0: AER: subordinate device reset failed
[ 170.151880] pcieport 0005:00:00.0: AER: device recovery failed
I donāt have sufficient knowledge of the PCIe internals to provide any specific advice. NVIDIA will have to reply. A couple of notes though:
quiet
ā occurs in ā/boot/extlinux/extlinux.conf
ā).Any test where you make sure the FPGA is fully up and running before power ever starts to the Jetson could be a useful test to see if something changes.
Turns out the issue was with my FPGA board itself, not the Orin. It was overheating and needed cooling, which is why is was turning off after being on for just a short while