Hi all,
We have a network on unmanaged switches ranging from NDR to EDR, with MLNX OpenSM being the subnet manager running on an EDR node connected to an EDR switch. Currently in opensm logs we see
******************************************************************
*********************** HEAVY SWEEP START ************************
******************************************************************
Aug 19 10:24:50 910089 [2A2A76C0] 0x02 -> do_sweep: Entering heavy sweep with flags: force_heavy_sweep 1, coming out of standby 0, subnet initialization error 0, sm port change 0
Aug 19 10:24:50 940459 [2A2A76C0] 0x02 -> updn_lid_matrices: disabling UPDN algorithm, no root nodes were found
Aug 19 10:24:50 940501 [2A2A76C0] 0x01 -> ucast_mgr_route: ar_updn: cannot build lid matrices.
Aug 19 10:24:50 951336 [2A2A76C0] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
Aug 19 10:24:51 019910 [2A2A76C0] 0x02 -> SUBNET UP
Aug 19 10:24:51 389593 [709326C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000388b00000080
Aug 19 10:24:52 493189 [639186C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000388c00000080
Aug 19 10:24:53 596876 [7693E6C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000388d00000080
Aug 19 10:24:54 700521 [699246C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000388e00000080
Aug 19 10:24:57 297814 [6B9286C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000388f00000080
Aug 19 10:24:58 401113 [6F9306C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000389000000080
Aug 19 10:24:59 569785 [689226C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x0000389100000080
Aug 19 10:25:01 025736 [2A2A76C0] 0x02 -> do_sweep:
This is constantly going on and as a result if the node goes down, our infiniband network crashes, without the backup opensm coming up. LID-101 is one of our NDR switches, the details of which are:
mlxlink -d lid-101
Operational Info
----------------
State : Active
Physical state : LinkUp
Speed : IB-NDR
Width : 4x
FEC : Ethernet_Consortium_LL_50G_RS_FEC_PLR -(272,257+1)
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed : 0x000000f1 (NDR,HDR,EDR,FDR,SDR)
Supported Cable Speed : 0x000000f1 (NDR,HDR,EDR,FDR,SDR)
Troubleshooting Info
--------------------
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed
Tool Information
----------------
Firmware Version : 31.2012.4036
amBER Version : 3.2
MFT Version : mft 4.28.0-92
Can someone help me troubleshoot this? As this is causing quite a few issues within our cluster.