Hello
Executing hca_self_test.ofead appears output as follows
---- Performing Adapter Device Self Test ----
Number of CAs Detected … 1
PCI Device Check … PASS
Kernel Arch … x86_64
Host Driver Version … MLNX_OFED_LINUX-5.8-1.1.2.1 (OFED-5.8-1.1.2): 4.18.0-425.10.1.el8_7.x86_64
Host Driver RPM Check … PASS
Firmware on CA #0 HCA … v20.35.2000
Host Driver Initialization … PASS
Number of CA Ports Active … 1
Port State of Port #1 on CA #0 (HCA)… UP 4X EDR (InfiniBand)
Error Counter Check on CA #0 (HCA)… FAIL
REASON: found errors in the following counters
Errors in /sys/class/infiniband/mlx5_0/ports/1/counters
port_rcv_remote_physical_errors: 39
Kernel Syslog Check … PASS
Node GUID on CA #0 (HCA) … e8:eb:d3:03:00:a6:10:62
------------------ DONE ---------------------
Output of perfquery command:
Port counters: Lid 58 port 1 (CapMask: 0x5A00)
PortSelect:…1
CounterSelect:…0x0000
SymbolErrorCounter:…0
LinkErrorRecoveryCounter:…0
LinkDownedCounter:…0
PortRcvErrors:…0
PortRcvRemotePhysicalErrors:…39
PortRcvSwitchRelayErrors:…0
PortXmitDiscards:…0
PortXmitConstraintErrors:…0
PortRcvConstraintErrors:…0
CounterSelect2:…0x00
LocalLinkIntegrityErrors:…0
ExcessiveBufferOverrunErrors:…0
QP1Dropped:…0
VL15Dropped:…0
PortXmitData:…4294967295
PortRcvData:…4294967295
PortXmitPkts:…4294967295
PortRcvPkts:…4294967295
PortXmitWait:…4294967295
I have tried perfquery -R 4 1 according to How to fix the HCA Self Test Fail (Error Counter Check on CA #0 (HCA))? but the error still remains.
update:
iblinkinfo provides this information for the device
9 27[ ] ==( 4X 25.78125 Gbps Active/ LinkUp)==> 58 1[ ] “servcxxxx01 HCA-1” ( )
9 28[ ] ==( Down/ Polling)==> [ ] “” ( )
Any help will be really apreciated.
Best Regards