VL15 Dropped / PortRcvSwitchRelayEroors

kilian.schnelle · March 30, 2023, 3:57pm

Hello

I have a HDR unmanged switch and some devices on and facing some errors i dont really know to interpret or solve.
Switch: MQM8790-HS2F
Cards in Device: ConnectX-6

I installed everything fresh and still face the problems, servers are running ubuntu 22.04LTS, OFED 5.8-2.0.3.0-LTS.
IPoIB just configured like this for example:

ibp33s0:
      critical: false
      addresses:
        - 192.168.124.1/22

Now when i do ibqueryerrors i get following output, with the counter growing in big intervalls every second, i just cleared the errors and counters before cause it even wen to overflow.

root@telly101:/sbdata# ibqueryerrors 
Errors for XXXXXXXXXX66 "Quantum Mellanox Technologies"
   GUID XXXXXXXXXX66 port ALL: [VL15Dropped == 1046 (1.021K)]
   GUID XXXXXXXXXX66 port 41: [VL15Dropped == 1128 (1.102K)]

## Summary: 9 nodes checked, 1 bad nodes found
##          49 ports checked, 1 ports have errors beyond threshold
## Thresholds: 
## Suppressed:

also rarly this error comes up for every port where a device is connected. Just werent in the first output cause of reseting. Its allways the GUID.

   GUID XXXXXXXXXXXXXXXX66 port 11: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 12: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 13: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 14: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 15: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 16: [PortRcvSwitchRelayErrors == 5 (5.000)]

Opensm is running on one of the servers without giving me any errors.

Someone has an idea?

Cheers
Kilian

dwaxman · March 30, 2023, 6:12pm

Hi,

Port 41 is a logical port used in case SHARP is enabled on the fabric (unless you have splitted ports in which case it will be port 81)

If you are not using SHARP – it may be necessary to review the SM configuration and turn it off.

In the SM conf set:

sharp_enabled = 1

As for the errors and their meaning:

VL15 dropped – those errors indicate SMPs are being dropped for some reason on the relevant ports. VL15 isn’t subjected to flow control – and isn’t expected to show this behavior. Though it may well be this relates to some misconfig in the SM (as mentioned above)

PortRcvSwitchRelayErrors – those errors indicate a port received a packet with some DLID which isn’t registered in the switch forwarding tables – thus it doesn’t know what egress port to use for those packets. Numbers seem to be correlated on the ports – which means some entity tried sending a packet to some LID that doesn’t exist in the fabric.

kilian.schnelle · March 31, 2023, 7:31am

I have everything on default, basically just put the cables into the switch and servers, installed ubuntu 22.04 and ofed driver, configured netplan as mentioned, did just systemctl start opensm, then just found this error when checkig ib status.

So i did no configuration on the sm.

kevino2 · March 31, 2023, 2:38pm

Hi,

A more accurate fabric analysis would require a support case in order to be able to share more data. If necessary, please open a case with Nvidia Technical Support.

Thank you

Topic		Replies	Views
OpenSM discovering same port over and over Mellanox OFED	4	859	August 2, 2023
SM LID is not configured warning InfiniBand/VPI Switch Systems	14	1377	July 1, 2016
Random Ping Loss in IPoIB Network InfiniBand/VPI Adapter Cards	2	75	December 20, 2024
IB card ports are down or polling InfiniBand/VPI Switch Systems	2	6069	July 9, 2023
Hello. We have problems on our old HPC cluster. Adapters and Cables	7	541	September 20, 2019
Infiniband nic State down but physical state linkUp InfiniBand/VPI Switch Systems infiniband	1	259	December 12, 2024
ConnectX-3 Pro: ports negotiate to a wrong protocol when the cable is plugged InfiniBand/VPI Adapter Cards	3	1300	February 9, 2019
IPoIB not working on Windows 2008 r2 - need help WinOF Driver	9	465	July 12, 2013
Trouble on ConnectX5 & RoCE v2 on Linux programming InfiniBand/VPI Adapter Cards	8	2003	October 4, 2023
VLAN setup on SR-IOV enabled Mellanox ConnectX-3 (CentOS7) Ethernet Adapter Cards	8	841	March 17, 2015

VL15 Dropped / PortRcvSwitchRelayEroors

Related topics