How to fix the HCA Self Test Fail (Error Counter Check on CA #0 (HCA))?

When I execute hca_self_test.ofed for testing configure of Infiniband, but I got the Error Counter Check on CA #0 (HCA) as following. I tried to reboot the machine, but this error was not removed.

$ sudo /usr/bin/hca_self_test.ofed

---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... OFED-internal-5.8-1.0.1: 4.15.0-200-generic
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v12.27.1016
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 1
Port State of Port #1 on CA #0 (HCA)..... UP 4X FDR (InfiniBand)
Error Counter Check on CA #0 (HCA)...... FAIL
    REASON: found errors in the following counters
      Errors in /sys/class/infiniband/mlx5_0/ports/1/counters
         port_rcv_errors: 320
         port_rcv_switch_relay_errors: 320
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... ec:0d:9a:03:00:c5:db:c0
------------------ DONE ---------------------

HCA I used is 88:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]. The the detail information of my HCA is as follows.

$ ibstat

CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.27.1016
	Hardware version: 0
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 4
		LMC: 0
		SM lid: 1
		Link layer: InfiniBand

I am at a bit of a loss and any help would be appreciated.

1 Like

The errors themselves are showing the switch is sending traffic that is addressed to the wrong destination LID (subnet local ID)

Have you tried clearing the counters on the device
(perfquery -R  in your case it will be perfquery -R 4 1)

And rerunning the self test?

1 Like

Thank you for your replay. I try to perfquery -R 4 1, then pass the test of hca_self_test.ofed!

---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... OFED-internal-5.8-1.0.1: 4.15.0-200-generic
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v12.27.1016
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 1
Port State of Port #1 on CA #0 (HCA)..... UP 4X FDR (InfiniBand)
Error Counter Check on CA #0 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... ec:0d:9a:03:00:c5:db:c0
------------------ DONE ---------------------

Thanks!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.