OpenSM discovering same port over and over

OpenSM 5.10.0 message state_mgr_report_new_ports: Discovered new port with GUID...
is repeated every few minutes.

The host behind this port has not changed at all in that intervall (no reboot, no activity, no errors).

This is a ConnectX-3 HCA connected to an HDR switch. Of this combination we have many, all of them working (so is this one), but not showing up in the OpenSM log.

What is going wrong here?

Hi,
Can you past more opensm log?
At the same time, I think it is not recommend for connect FDR card to HDR switch.
It is not tested.

Thanks,
Suo

Well, I think Mellanox must have tested it - otherwise, where does the compatibility matrix from? ;-)
Also, we have extensively tested it - works.

In this case, the relevant log lines read

Jul 31 02:10:29 083930 [2A3E0700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:1 TID:0x000008a500000080
Jul 31 02:10:29 084014 [2A3E0700] 0x02 -> SM class trap 128: Directed Path Dump of 3 hop path: Path = 0,1,1,16
Jul 31 02:10:29 084027 [2A3E0700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:1 GID:fe80::1070:fd03:58:c956
Jul 31 02:10:29 398864 [323F0700] 0x02 -> osm_spst_rcv_process: Switch 0x1070fd030058c956 MF0;hdrleaf-gc5a-28:MQM8700/U1 port 77 changed state from ACTIVE to DOWN 
Jul 31 02:10:29 500813 [1B1C2700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:762 GID:ff12:601b:ffff::1:ffe8:841
Jul 31 02:10:29 500836 [1B1C2700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:65 (GID out of service) from LID:762 GID:fe80::248a:703:e8:841
Jul 31 02:10:29 500967 [1B1C2700] 0x02 -> drop_mgr_remove_port: Removed port with GUID:0x248a070300e80841 LID range [162, 162] of node:lxmds21 mlx4_0

So, lid 1 is the switch, also named hdrleaf-gc5a-28. Lid 162 aka lxmds21 allegedly dropped out, but the machine did not feel it.

If I just look for occurences,

~# grep 'Discovered new port' /var/log/opensm.0x043f720300fe9f00.log
Jul 31 02:11:10 527863 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 02:11:44 464605 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 02:13:59 011818 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 02:16:22 202813 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 03:56:11 374611 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x98039b0300d022a1 LID range [163,163] of node: lxmds23 mlx4_0
Jul 31 05:06:12 049500 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 05:28:04 027643 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x98039b0300d022a1 LID range [163,163] of node: lxmds23 mlx4_0

there is another machine, lxmds23, that is also completely idle and not rebooting.

If this is perhals connected to the HCA hardware, it would’nt matter so much, could change it.

If this rather points to an issue with the switch, it is more urgent, because there are more important machines connected to that switch.

Thanks,
Thomas

Hi,
Please check the firmware release note:

Firmware Interoperability

—This is what we tested.
Thanks,
Suo

https://docs.nvidia.com/networking/display/NVIDIAQuantumFirmwarev2720106102/Firmware+Compatible+Products

Thanks, Suo.
The point about ports #27-34, I faintly remembered this but did not care enough when setting up our current fabric.

We do have ConnectX3 ↔ Quantum connections on theses “wrong” ports - seeing no problems so far. But this can be recabled, of course.

However, the two machines I worried about here are conected to “allowed” ports on the Quantum switch.
According to your firmware compatibility matrix, these should work.

Anyhow I will have the problematic machines relocated, which would at least work around this issue.

Regards,
Thomas