Understanding Mellanox OpenSM config

Hi,
I want to understand the default configuration of Mellanox OpenSM. I had an issue where our opensm, when ran out of the box as a daemon, without a config, was doing heavy sweeps more than the default value of 10 seconds between heavy sweeps due to a faulty cable. My question is that is it a bug? Or do we need to have a config for opensm in /etc/opensm/. I double checked it by generating a configuration from the opensm and it did give me the default value of 10 seconds per sweep. For example below is an example of two sweeps opensm did in just a few milliseconds, this was so much that it forced the node and the infiniband network to crash.


******************************************************************
*********************** HEAVY SWEEP START ************************
******************************************************************


Aug 19 13:34:28 934212 [2A2A76C0] 0x02 -> do_sweep: Entering heavy sweep with flags: force_heavy_sweep 1, coming out of standby 0, subnet initialization error 0, sm port change 0
Aug 19 13:34:28 969254 [2A2A76C0] 0x02 -> updn_lid_matrices: disabling UPDN algorithm, no root nodes were found
Aug 19 13:34:28 969294 [2A2A76C0] 0x01 -> ucast_mgr_route: ar_updn: cannot build lid matrices.
Aug 19 13:34:28 979739 [2A2A76C0] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
Aug 19 13:34:29 049107 [2A2A76C0] 0x02 -> SUBNET UP
Aug 19 13:34:30 037782 [611136C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x000059be00000080
Aug 19 13:34:30 037898 [611136C0] 0x02 -> SM class trap 128: Directed Path Dump of 4 hop path: Path = 0,1,4,18,27
Aug 19 13:34:30 037921 [611136C0] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:101 GID:fe80::fc6a:1c03:c8:dc00
Aug 19 13:34:30 038021 [2A2A76C0] 0x02 -> do_sweep:


******************************************************************
*********************** HEAVY SWEEP START ************************
******************************************************************


Aug 19 13:34:30 038070 [2A2A76C0] 0x02 -> do_sweep: Entering heavy sweep with flags: force_heavy_sweep 1, coming out of standby 0, subnet initialization error 0, sm port change 0
Aug 19 13:34:30 077499 [2A2A76C0] 0x02 -> updn_lid_matrices: disabling UPDN algorithm, no root nodes were found
Aug 19 13:34:30 077539 [2A2A76C0] 0x01 -> ucast_mgr_route: ar_updn: cannot build lid matrices.
Aug 19 13:34:30 089267 [2A2A76C0] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
Aug 19 13:34:30 160099 [2A2A76C0] 0x02 -> SUBNET UP
Aug 19 13:34:31 206263 [639186C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:101 TID:0x000059bf00000080
Aug 19 13:34:31 206387 [639186C0] 0x02 -> SM class trap 128: Directed Path Dump of 4 hop path: Path = 0,1,4,18,27
Aug 19 13:34:31 206410 [639186C0] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:101 GID:fe80::fc6a:1c03:c8:dc00
Aug 19 13:34:31 206541 [2A2A76C0] 0x02 -> do_sweep:

Related Ticket: MLNX OpenSM force sweeping every minute - #4 by marlon1

Hi , this is because you enable sweep on trap ,
Please check the explain, when a link stat change it will cause heavy sweep.

If TRUE, every trap 128 and 144 will cause a heavy sweep.

NOTE: successive identical traps (>10) are suppressed

NOTE: Debug option. Changing the value is not recommended.

sweep_on_trap TRUE