I have 3 ConnectX-5 Ex cards, each inside a server and they are connected in a daisy chain (server1 → server2 → server3). After rebooting the servers all seems to be fine and I can ping from server1 to server2. After some time, the card in the middle stops working and all the communication stops. When this happens, the card in the middle shows “rev FF” in lspci and also in mst status.
Here is an example. The first lspci shows the status in normal mode. Then I use ping to talk to the server next to this machine. Eventually ping stops working. After that, the card shows the rev FF message. After that, a reboot gets the card back to working (for a while)
[rcastro@simfarm14 ~]$ lspci | grep Ethernet
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 06)
03:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
03:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
[rcastro@simfarm14 ~]$ ping 192.168.1.7
PING 192.168.1.7 (192.168.1.7) 56(84) bytes of data.
64 bytes from 192.168.1.7: icmp_seq=1 ttl=64 time=0.188 ms
…
64 bytes from 192.168.1.7: icmp_seq=124 ttl=64 time=0.146 ms
64 bytes from 192.168.1.7: icmp_seq=125 ttl=64 time=0.148 ms
64 bytes from 192.168.1.7: icmp_seq=126 ttl=64 time=0.168 ms
64 bytes from 192.168.1.7: icmp_seq=127 ttl=64 time=0.148 ms
64 bytes from 192.168.1.7: icmp_seq=128 ttl=64 time=0.146 ms
64 bytes from 192.168.1.7: icmp_seq=129 ttl=64 time=0.169 ms
64 bytes from 192.168.1.7: icmp_seq=130 ttl=64 time=0.152 ms
64 bytes from 192.168.1.7: icmp_seq=131 ttl=64 time=0.153 ms
From 192.168.1.4 icmp_seq=143 Destination Host Unreachable
From 192.168.1.4 icmp_seq=144 Destination Host Unreachable
From 192.168.1.4 icmp_seq=145 Destination Host Unreachable
From 192.168.1.4 icmp_seq=146 Destination Host Unreachable
From 192.168.1.4 icmp_seq=147 Destination Host Unreachable
From 192.168.1.4 icmp_seq=148 Destination Host Unreachable
From 192.168.1.4 icmp_seq=149 Destination Host Unreachable
From 192.168.1.4 icmp_seq=150 Destination Host Unreachable
From 192.168.1.4 icmp_seq=151 Destination Host Unreachable
From 192.168.1.4 icmp_seq=152 Destination Host Unreachable
^C
— 192.168.1.7 ping statistics —
154 packets transmitted, 131 received, +10 errors, 14.9351% packet loss, time 739ms
rtt min/avg/max/mdev = 0.072/0.147/0.188/0.021 ms, pipe 2
[rcastro@simfarm14 ~]$ lspci | grep Ethernet
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 06)
03:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] (rev ff)
03:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] (rev ff)
Is there a software or configuration issue I may be missing or could be causing this problem?
Thanks