Well, I think Mellanox must have tested it - otherwise, where does the compatibility matrix from? ;-)
Also, we have extensively tested it - works.
In this case, the relevant log lines read
Jul 31 02:10:29 083930 [2A3E0700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:1 TID:0x000008a500000080
Jul 31 02:10:29 084014 [2A3E0700] 0x02 -> SM class trap 128: Directed Path Dump of 3 hop path: Path = 0,1,1,16
Jul 31 02:10:29 084027 [2A3E0700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:1 GID:fe80::1070:fd03:58:c956
Jul 31 02:10:29 398864 [323F0700] 0x02 -> osm_spst_rcv_process: Switch 0x1070fd030058c956 MF0;hdrleaf-gc5a-28:MQM8700/U1 port 77 changed state from ACTIVE to DOWN
Jul 31 02:10:29 500813 [1B1C2700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:762 GID:ff12:601b:ffff::1:ffe8:841
Jul 31 02:10:29 500836 [1B1C2700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:65 (GID out of service) from LID:762 GID:fe80::248a:703:e8:841
Jul 31 02:10:29 500967 [1B1C2700] 0x02 -> drop_mgr_remove_port: Removed port with GUID:0x248a070300e80841 LID range [162, 162] of node:lxmds21 mlx4_0
So, lid 1 is the switch, also named hdrleaf-gc5a-28. Lid 162 aka lxmds21 allegedly dropped out, but the machine did not feel it.
If I just look for occurences,
~# grep 'Discovered new port' /var/log/opensm.0x043f720300fe9f00.log
Jul 31 02:11:10 527863 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 02:11:44 464605 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 02:13:59 011818 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 02:16:22 202813 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 03:56:11 374611 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x98039b0300d022a1 LID range [163,163] of node: lxmds23 mlx4_0
Jul 31 05:06:12 049500 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x248a070300e80841 LID range [162,162] of node: lxmds21 mlx4_0
Jul 31 05:28:04 027643 [1B1C2700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0x98039b0300d022a1 LID range [163,163] of node: lxmds23 mlx4_0
there is another machine, lxmds23, that is also completely idle and not rebooting.
If this is perhals connected to the HCA hardware, it would’nt matter so much, could change it.
If this rather points to an issue with the switch, it is more urgent, because there are more important machines connected to that switch.
Thanks,
Thomas