VL15 Dropped / PortRcvSwitchRelayEroors

Hello

I have a HDR unmanged switch and some devices on and facing some errors i dont really know to interpret or solve.
Switch: MQM8790-HS2F
Cards in Device: ConnectX-6

I installed everything fresh and still face the problems, servers are running ubuntu 22.04LTS, OFED 5.8-2.0.3.0-LTS.
IPoIB just configured like this for example:

ibp33s0:
      critical: false
      addresses:
        - 192.168.124.1/22

Now when i do ibqueryerrors i get following output, with the counter growing in big intervalls every second, i just cleared the errors and counters before cause it even wen to overflow.

root@telly101:/sbdata# ibqueryerrors 
Errors for XXXXXXXXXX66 "Quantum Mellanox Technologies"
   GUID XXXXXXXXXX66 port ALL: [VL15Dropped == 1046 (1.021K)]
   GUID XXXXXXXXXX66 port 41: [VL15Dropped == 1128 (1.102K)]

## Summary: 9 nodes checked, 1 bad nodes found
##          49 ports checked, 1 ports have errors beyond threshold
## Thresholds: 
## Suppressed:

also rarly this error comes up for every port where a device is connected. Just werent in the first output cause of reseting. Its allways the GUID.

   GUID XXXXXXXXXXXXXXXX66 port 11: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 12: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 13: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 14: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 15: [PortRcvSwitchRelayErrors == 5 (5.000)]
   GUID XXXXXXXXXXXXXXXX66 port 16: [PortRcvSwitchRelayErrors == 5 (5.000)]

Opensm is running on one of the servers without giving me any errors.

Someone has an idea?

Cheers
Kilian

Hi,

Port 41 is a logical port used in case SHARP is enabled on the fabric (unless you have splitted ports in which case it will be port 81)

If you are not using SHARP – it may be necessary to review the SM configuration and turn it off.

In the SM conf set:

sharp_enabled = 1

As for the errors and their meaning:

VL15 dropped – those errors indicate SMPs are being dropped for some reason on the relevant ports. VL15 isn’t subjected to flow control – and isn’t expected to show this behavior. Though it may well be this relates to some misconfig in the SM (as mentioned above)

PortRcvSwitchRelayErrors – those errors indicate a port received a packet with some DLID which isn’t registered in the switch forwarding tables – thus it doesn’t know what egress port to use for those packets. Numbers seem to be correlated on the ports – which means some entity tried sending a packet to some LID that doesn’t exist in the fabric.

I have everything on default, basically just put the cables into the switch and servers, installed ubuntu 22.04 and ofed driver, configured netplan as mentioned, did just systemctl start opensm, then just found this error when checkig ib status.

So i did no configuration on the sm.

Hi,

A more accurate fabric analysis would require a support case in order to be able to share more data. If necessary, please open a case with Nvidia Technical Support.

Thank you