we are currently facing a problem with a custom design. We developed a board using the K1 with 2x i210 controllers with attached NVM (firmware: 3.25, 0x800005cf - flashed manually). iperf works perfectly but we have problems transferring specific frame sizes (on both interfaces). We debugged it down to the following problems:
iperf shows data rates ~ 940 MBit/s, no errors in ifconfig
specific frame lengths received by the i210 show errors (67 error, 68 ok, 69 error, 70 ok, … 94 ok, 95 error, 96 ok, 97 ok, … 160 ok, 161 error, 162 ok, 163 error, …). This was tested sending 1, 2, 3, … bytes using nc via TCP. These errors are always reproducible!
this shows sort of “block errors”
The error seen via tcpdump/wireshark is always the same: Independently of the payload size sent, in case of an error only the last byte of the TCP payload is changed randomly (compared to the one sent)!
Enabling RX offloading marks the TCP checksum is correct -> Tcpdump sees wrong data
Disabling RX offloading shows the error in tcpdump, Linux drops the packet
Debugging the DMA frame in the igb driver shows that the error is already present there (so no Linux driver/stack issue)
I do not know about the issue, but if you run “lspci” you should see your PCIe-based NIC. Use the identifier (number format nn.nn.n on left) and run verbose. Look for mention of errors after your errors are noted from netcat experiments…if the error is PCIe, you will probably get stats on it. If not mentioned as a PCIe error, then the issue is elsewhere.
If possible, can you show the output of “lspci -t -vv”? I see a bridge and a single endpoint, but you mentioned two i210 controllers. I’m curious if the second i210 controller shows up; information below on the i210 is for just a single controller.
The bridge does not report detecting any errors (nothing correctable, uncorrectable, nor fatal detected).
The PCIe shows an i210 controller (“01:00.0”) capable of PCIe revision 1 speed using a single lane, which matches actual operation (PCIe functioning as designed). Similar to the bridge, the data link has not found any error of the types correctable, uncorrectable, nor fatal. For this device, PCIe is not the cause of error. Keep in mind that this is a single i210 controller at “01:00.0”, so if there is a second controller a similar comparison would be required.
What is the output of “ifconfig” after an error has occurred? If no error occurs there, then likely the issue is in user space, e.g., the program actually sending or receiving; if the ifconfig shows error, then probably there is some sort of issue in drivers or network configuration (e.g., collisions between two properly working machines which were misconfigured to use the same address or inability to handle overwhelming traffic flow).
we finally found out what the problem was. We made a mistake in the schematic connecting some (not all) DM/DQS lines to the wrong block of data lines at the DDR3 chip. Using the data swizzle of the K1 obviously did not work on those lines (only working on block base). This resulted in interchanged bytes of a 64 bit DDR3 access. As long as we accessed those bytes cached (i.e. multiples of at least 32 bit), the alignment of bytes did not matter. Nevertheless we had the mentioned problems when using DMA transactions (which PCIe/our NIC obviously performs) that writes single bytes sometimes.