NIC link is Down when using Dual Port (PCIe x4) Gigabit Ethernet Server Adapter Network Card - Intel...

Hello,

I have a network problem when using with nvidia tx2 the Dual Port PCI Express (PCIe x4) Gigabit Ethernet Server Adapter Network Card - Intel i350 NIC (ST2000SPEXI)

The setup:

To the nvidia tx2 i have connected 2 devices (other dev boards) directly to the PCI dual ports(ETH1 and ETH2).The nvidia ETH0 (is connected via ssh to another pc).

[i]eth0 Link encap:Ethernet HWaddr 00:04:4b:c5:01:0d
inet addr:192.168.157.42 Bcast:192.168.157.255 Mask:255.255.255.0
inet6 addr: fe80::204:4bff:fec5:10d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1378709 errors:0 dropped:0 overruns:0 frame:0
TX packets:4139259 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:85100195 (85.1 MB) TX bytes:27988495868 (27.9 GB)
Interrupt:42

eth1 Link encap:Ethernet HWaddr 00:0a:cd:33:54:7b
inet addr:169.254.8.202 Bcast:169.254.8.255 Mask:255.255.255.0
inet6 addr: fe80::20a:cdff:fe33:547b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7463869 errors:0 dropped:0 overruns:0 frame:0
TX packets:112 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:516219413 (516.2 MB) TX bytes:11010 (11.0 KB)
Memory:50100000-5011ffff

eth2 Link encap:Ethernet HWaddr 00:0a:cd:33:54:7c
inet addr:169.254.9.202 Bcast:169.254.9.255 Mask:255.255.255.0
inet6 addr: fe80::20a:cdff:fe33:547c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7680832 errors:0 dropped:0 overruns:0 frame:0
TX packets:112 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:531224791 (531.2 MB) TX bytes:11009 (11.0 KB)
Memory:50120000-5013ffff
[/i]

The problem:

Randomly just the ETH1 (from PCIexpress) is lossing connection for about 1-2 seconds:

nvidia@tegra-ubuntu:~$ dmesg | grep igb
[ 7484.066821] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Down
[ 7485.771149] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
[ 7501.430821] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Down
[ 7503.063148] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX

Basicaly the behaviour is like somebody is un-pluging the cable from the PCI board for port ETH1
The boards connected to ETH1 and ETH2 are identically with identical config and code on them.
I never lose connection if i touch the cable or i simulate an imperfect contact. 
The ethernet driver is a built-in one <igb> or something and the lib/modules version is 4.4.38

- I have checked on other topics this problem but nothing works.
- I have changed the cables and still no solution.

Please help me with some hints!
Thanks

Is the “ifconfig” output from before or after a failure? If from before, what do you see after a failure? Also, what is the output of the “route” command?

What is the output from “uname -r”? Is this also 4.4.38? Or something like 4.4.38-tegra? Are 100% of the required kernel modules at:

/lib/modules/$(uname -r)/

Are you running with a performance mode, or just a default mode? To set max performance the command differs slightly depending on release, but if this is a current release (you might want to mention the L4T release, e.g., from “head -n 1 /etc/nv_tegra_release”):

sudo nvpmodel -m 0
sudo jetson_clocks

…if previously you were not using performance mode, see if the lost ethernet is fixed by running performance mode.

Hello,

Thanks for answering me!

The “ifconfig” output is after lossing ETH1 link many times. I am using the same source of data and sending the same packets to ETH1 and ETH2 and running if config can point out the loss:

Receiveing part (ETH1 and ETH2):

[i]ETH1:

RX bytes:516219413 (516.2 MB) TX bytes:11010 (11.0 KB)[/i]

ETH2

RX bytes:531224791 (531.2 MB) TX bytes:11009 (11.0 KB)

At the beginning after the boot when the ETH1 is not going down for a while those two RX bytes are matching.

I was checking the LEDs (green and yellow) from ETH1 and when the link is going down those 2 leds are turned off for 1-2 seconds matching the timestamp in the “dmesg” log:

[ 7484.066821] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Down
[ 7485.771149] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX

Running route:

nvidia@tegra-ubuntu:~$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default router.asus.com 0.0.0.0 UG 600 0 0 wlan0
link-local * 255.255.0.0 U 1000 0 0 eth0
169.254.8.0 * 255.255.255.0 U 0 0 0 eth1
169.254.9.0 * 255.255.255.0 U 0 0 0 eth2
172.17.0.0 * 255.255.0.0 U 0 0 0 docker0
173.16.250.0 * 255.255.255.0 U 600 0 0 wlan0
192.168.55.0 * 255.255.255.0 U 0 0 0 l4tbr0
192.168.157.0 * 255.255.255.0 U 0 0 0 eth0

Running uname:

nvidia@tegra-ubuntu:~$ uname -r
4.4.38

Checking in the root file next to this folder “4.4.38” there is another one “4.4.38-tegra”

nvidia@tegra-ubuntu:~ /lib/modules/(uname -r)/
-bash: /lib/modules/4.4.38/: Is a directory

I am running in performance mode with all 6 processors active.

Running release version:

[i]nvidia@tegra-ubuntu:~$ head -n 1 /etc/nv_tegra_release

R28 (release), REVISION: 2.1, GCID: 11272647, BOARD: t186ref, EABI: aarch64, DATE: Thu May 17 07:29:06 UTC 2018[/i]

The ifconfig itself says the system is running without error.

Do you have actual modules in the “/lib/modules/4.4.38/” directory? If not, then modules will be unreachable. I don’t know if your custom kernel config is part of the issue or not, but if there is a need for some unreachable module, then this will be a problem.

R28.1 is a bit old, there is an R28.3.1 which should be “mostly” compatible:
https://developer.nvidia.com/embedded/linux-tegra-archive

The NIC which is starting/stopping is at 100Mb/s. The adapter is gigabit. What is the adapter talking to? A gigabit switch? Directly to a camera or other appliance? Whatever the failing NIC is connected to, should it be gigabit capable?

We cannot share much here since there is no such device on our side.

  1. Does this happen intermittent? It sounds like you are not able to reproduce this every time right?

Randomly just the ETH1 (from PCIexpress) is lossing connection for about 1-2 seconds:

nvidia@tegra-ubuntu:~$ dmesg | grep igb
[ 7484.066821] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Down
[ 7485.771149] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
[ 7501.430821] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Down
[ 7503.063148] igb 0000:01:00.0 eth1: igb: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX

  1. Does dmesg give out any pcie log when issue happens? If not, I think it is specific to igb driver or the connection problem.

Hello Linuxdev,

In “/lib/modules/4.4.38/” directory:

nvidia@tegra-ubuntu:/lib/modules/4.4.38$ ls
build modules.builtin modules.devname modules.symbols.bin
kernel modules.builtin.bin modules.order source
modules.alias modules.dep modules.softdep t18x
modules.alias.bin modules.dep.bin modules.symbols

In “/lib/modules/4.4.38-tegra/” directory:

nvidia@tegra-ubuntu:/lib/modules/4.4.38-tegra$ ls
build modules.dep.bin modules.seriomap
kernel modules.devname modules.softdep
modules.alias modules.ieee1394map modules.symbols
modules.alias.bin modules.inputmap modules.symbols.bin
modules.builtin modules.isapnpmap modules.usbmap
modules.builtin.bin modules.ofmap t18x
modules.ccwmap modules.order
modules.dep modules.pcimap

The adaptor (dual pcie lan port) is talking to 2 identical “Texas instruments” development boards 1 connected to ETH1 and second to ETH2

Thanks to your suggestion i have tested the port with a PC and i have noticed that the port ETH1 is not dropping.

But it does not have sense since both “Texas instruments” boards have the same configuration and the one connected to ETH2 is not lossing the link connection.

I will do more tests to see what is happening but it seems that somewhere here is the problem.

I will reply having more details

Hello WayneWWW,

  1. I cannot reproduce it, it happens suddently sometimes after minutes between the drop sometimes after seconds, but everytime the loosing connection period is no more than 1-2 seconds.

  2. There is no pcie drop message only this “igb” issue.

    I have checked on another nvidia with the same adaptor and it happens in the same manner.

Please see the comment posted for “linuxdev”.
I will investigate more those 2 “Texas intruments” boards even if the configuration is the same and just one is losing connection.

With a regular PC and a “Texas intruments” board both the connections ETH1 and ETH2 are working fine no loosing connection.

It does not matter which one i connect.

It has to do something with connecting just both.

I will let you know more details.

Unless there are further subdirectories in “/lib/modules/4.4.38/”, then all of your modules are missing (I may have not made it clear, but modules are in a tree within that root location…I’m assuming from here there are no modules, but maybe there are…if you have the modules, then ignore this). The files listed in your “/lib/modules/4.4.38/” parent directory are just metadata. Try instead:

find /lib/modules/$(uname -r)/ '*.ko*'

…compare the list to the files in “/lib/modules/4.4.38-tegra/”, and you’ll see a lot of files not listed in the other directory. These are all missing drivers and features. You need to either rebuild the kernel with the “-tegra” CONFIG_LOCALVERSION (resulting in “uname -r” of “4.4.38-tegra”), or else build all modules and install them to the “/lib/modules/4.4.38/” location.

If the base Image has changed enough, then you should rebuild all modules and install them. If you just added some feature and did not radically alter the kernel Image, then you should rebuild the Image with CONFIG_LOCALVERSION set.

FYI, the file “/proc/config.gz” is a reflection in RAM of the current kernel build settings. You can list all module configurations (and thus everything that is missing) via:

gunzip < /proc/config.gz | grep '=m'