ConnectX-5 Ex card stops working after some time. lspci shows "rev FF" and the card does not respond to pings or any other Ethernet communication

I have 3 ConnectX-5 Ex cards, each inside a server and they are connected in a daisy chain (server1 → server2 → server3). After rebooting the servers all seems to be fine and I can ping from server1 to server2. After some time, the card in the middle stops working and all the communication stops. When this happens, the card in the middle shows “rev FF” in lspci and also in mst status.

Here is an example. The first lspci shows the status in normal mode. Then I use ping to talk to the server next to this machine. Eventually ping stops working. After that, the card shows the rev FF message. After that, a reboot gets the card back to working (for a while)

[rcastro@simfarm14 ~]$ lspci | grep Ethernet

00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 06)

03:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

03:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

[rcastro@simfarm14 ~]$ ping 192.168.1.7

PING 192.168.1.7 (192.168.1.7) 56(84) bytes of data.

64 bytes from 192.168.1.7: icmp_seq=1 ttl=64 time=0.188 ms

64 bytes from 192.168.1.7: icmp_seq=124 ttl=64 time=0.146 ms

64 bytes from 192.168.1.7: icmp_seq=125 ttl=64 time=0.148 ms

64 bytes from 192.168.1.7: icmp_seq=126 ttl=64 time=0.168 ms

64 bytes from 192.168.1.7: icmp_seq=127 ttl=64 time=0.148 ms

64 bytes from 192.168.1.7: icmp_seq=128 ttl=64 time=0.146 ms

64 bytes from 192.168.1.7: icmp_seq=129 ttl=64 time=0.169 ms

64 bytes from 192.168.1.7: icmp_seq=130 ttl=64 time=0.152 ms

64 bytes from 192.168.1.7: icmp_seq=131 ttl=64 time=0.153 ms

From 192.168.1.4 icmp_seq=143 Destination Host Unreachable

From 192.168.1.4 icmp_seq=144 Destination Host Unreachable

From 192.168.1.4 icmp_seq=145 Destination Host Unreachable

From 192.168.1.4 icmp_seq=146 Destination Host Unreachable

From 192.168.1.4 icmp_seq=147 Destination Host Unreachable

From 192.168.1.4 icmp_seq=148 Destination Host Unreachable

From 192.168.1.4 icmp_seq=149 Destination Host Unreachable

From 192.168.1.4 icmp_seq=150 Destination Host Unreachable

From 192.168.1.4 icmp_seq=151 Destination Host Unreachable

From 192.168.1.4 icmp_seq=152 Destination Host Unreachable

^C

— 192.168.1.7 ping statistics —

154 packets transmitted, 131 received, +10 errors, 14.9351% packet loss, time 739ms

rtt min/avg/max/mdev = 0.072/0.147/0.188/0.021 ms, pipe 2

[rcastro@simfarm14 ~]$ lspci | grep Ethernet

00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 06)

03:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] (rev ff)

03:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] (rev ff)

Is there a software or configuration issue I may be missing or could be causing this problem?

Thanks

Hi Rafael,

In order to debug this issue and provide a root cause , i suggest to open support case by sending email to support@mellanox.com

When opening the support case please provide us with the S/N and P/N of the faulty adapter

With the below information:

  1. Server type

  2. Topology Diagram (What is connected to what , Back to back severs , with switch in between)

  3. Is it happening in one server only ? with specific card ?

  4. Does replacing the card solve this issue ?

  5. Does inserting this card in other good known server , the issue persist ?

Thanks,

Samer