ConnectX-5 tx_pci_signal_integrity

I am working with a Weka.IO cluster built on Supermicro SYS-2029BT-HNR systems using ConnectX-5 adapters. I’m seeing this:

[root@wnode1 ~]# ethtool -S enp59s0f0 | grep pci

rx_pci_signal_integrity: 0

tx_pci_signal_integrity: 3

outbound_pci_stalled_rd: 0

outbound_pci_stalled_wr: 0

outbound_pci_stalled_rd_events: 0

outbound_pci_stalled_wr_events: 0

[root@wnode1 ~]#

Engineering at Weka.IO is concerned about this. I’ve updated the bios on one of the servers to the latest, but the counters are still showing errors as above. Anyone have any ideas? This is happening on all 20 nodes in the cluster, so I have a hard time believing it’s bad hardware (in bulk).

Thanks!

Ken

Hi Ken,

tx_pci_signal_integrity counts physical layer PCIe signal integrity errors, the number of transition to recovery initiated by the other side.

In addition to BIOS version, can you please validate that you are running with the latest firmware available? (16.24.1000)

Regards,

Chen

Hi Chen,

I have tried both BIOS version 2.1a (which they came with) and 3.0c. I am running the latest firmware on the cards. I’ve just tried moving one card to the other slot in the server and am still seeing the errors.

–Ken