K1 connected to i210: Strange Errors

Hi everybody,

we are currently facing a problem with a custom design. We developed a board using the K1 with 2x i210 controllers with attached NVM (firmware: 3.25, 0x800005cf - flashed manually). iperf works perfectly but we have problems transferring specific frame sizes (on both interfaces). We debugged it down to the following problems:

  • iperf shows data rates ~ 940 MBit/s, no errors in ifconfig
  • specific frame lengths received by the i210 show errors (67 error, 68 ok, 69 error, 70 ok, … 94 ok, 95 error, 96 ok, 97 ok, … 160 ok, 161 error, 162 ok, 163 error, …). This was tested sending 1, 2, 3, … bytes using nc via TCP. These errors are always reproducible!
  • this shows sort of “block errors”
  • The error seen via tcpdump/wireshark is always the same: Independently of the payload size sent, in case of an error only the last byte of the TCP payload is changed randomly (compared to the one sent)!
  • Enabling RX offloading marks the TCP checksum is correct -> Tcpdump sees wrong data
  • Disabling RX offloading shows the error in tcpdump, Linux drops the packet
  • Debugging the DMA frame in the igb driver shows that the error is already present there (so no Linux driver/stack issue)
  • we also used igb drivers 5.0.3 and 5.3.4.4

There is another user in the Intel Support Forum facing exactly the same problem: https://embedded.communities.intel.com/thread/12355

Are there any known issues in the K1 PCIe core, DMA engine or system busses explaining this strange error?

Best regards,

Olaf

I do not know about the issue, but if you run “lspci” you should see your PCIe-based NIC. Use the identifier (number format nn.nn.n on left) and run verbose. Look for mention of errors after your errors are noted from netcat experiments…if the error is PCIe, you will probably get stats on it. If not mentioned as a PCIe error, then the issue is elsewhere.

lspci -s 'nn.nn.n' -vvv

Not sure if it’s of any use but check this out: https://devtalk.nvidia.com/default/topic/903957/jetson-tk1/tk1-intel-ethernet-controller-i210-it/post/4760503/#4760503

From my point of view there are no errors:

00:00.0 PCI bridge: NVIDIA Corporation TegraK1 PCIe x4 Bridge (rev a1) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 641
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00001000-00001fff
        Memory behind bridge: 32200000-322fffff
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+
                Address: 00000000ad0a0000  Data: 0001
        Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
                Mapping Address Base: 00000000fee00000
        Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Off, PwrInd On, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet+ LinkState+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Kernel driver in use: pcieport

01:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 130
        Region 0: Memory at 32200000 (32-bit, non-prefetchable) 
        Region 2: I/O ports at 1000 
        Region 3: Memory at 32280000 (32-bit, non-prefetchable) 
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-a0-c9-ff-ff-00-00-02
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Kernel driver in use: igb
        Kernel modules: igb

If possible, can you show the output of “lspci -t -vv”? I see a bridge and a single endpoint, but you mentioned two i210 controllers. I’m curious if the second i210 controller shows up; information below on the i210 is for just a single controller.

The bridge does not report detecting any errors (nothing correctable, uncorrectable, nor fatal detected).

The PCIe shows an i210 controller (“01:00.0”) capable of PCIe revision 1 speed using a single lane, which matches actual operation (PCIe functioning as designed). Similar to the bridge, the data link has not found any error of the types correctable, uncorrectable, nor fatal. For this device, PCIe is not the cause of error. Keep in mind that this is a single i210 controller at “01:00.0”, so if there is a second controller a similar comparison would be required.

What is the output of “ifconfig” after an error has occurred? If no error occurs there, then likely the issue is in user space, e.g., the program actually sending or receiving; if the ifconfig shows error, then probably there is some sort of issue in drivers or network configuration (e.g., collisions between two properly working machines which were misconfigured to use the same address or inability to handle overwhelming traffic flow).

Hi everybody,

we finally found out what the problem was. We made a mistake in the schematic connecting some (not all) DM/DQS lines to the wrong block of data lines at the DDR3 chip. Using the data swizzle of the K1 obviously did not work on those lines (only working on block base). This resulted in interchanged bytes of a 64 bit DDR3 access. As long as we accessed those bytes cached (i.e. multiples of at least 32 bit), the alignment of bytes did not matter. Nevertheless we had the mentioned problems when using DMA transactions (which PCIe/our NIC obviously performs) that writes single bytes sometimes.

Thanks for helping out!

Olaf