I am working with a Weka.IO cluster built on Supermicro SYS-2029BT-HNR systems using ConnectX-5 adapters. I’m seeing this:
[root@wnode1 ~]# ethtool -S enp59s0f0 | grep pci
rx_pci_signal_integrity: 0
tx_pci_signal_integrity: 3
outbound_pci_stalled_rd: 0
outbound_pci_stalled_wr: 0
outbound_pci_stalled_rd_events: 0
outbound_pci_stalled_wr_events: 0
[root@wnode1 ~]#
Engineering at Weka.IO is concerned about this. I’ve updated the bios on one of the servers to the latest, but the counters are still showing errors as above. Anyone have any ideas? This is happening on all 20 nodes in the cluster, so I have a hard time believing it’s bad hardware (in bulk).
Thanks!
Ken