Hi,
We have a mixed network of infiniband ranging from FDR-NDR, in this configuration we have ran infiniband using non-managed switches and the master openSM running on an NDR node (MLNX 5.20.0.MLNX20240804), the OS is debian 12, this configuration has worked for us for the past year with little issues (there was once an issue with a cable that caused the cable to flap and bringing the network down as the opensm triggered heavy sweep after heavy sweep). We do not have any special configuration for opensm and everything is out of the box.
Yesterday, we came across an issue where the network was brought to a state where nodes were going unresponsive/responsive, in open sm I saw the following errors:
Sep 21 13:38:04 466969 [A060B6C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f20540fa890 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x3956be6e1, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 466982 [A060B6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 466986 [59D7E6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 466999 [805CB6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 470576 [BE6476C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f2054154430 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x35763cbf1, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 470597 [BE6476C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 470601 [60D8C6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 470776 [BB6416C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f205412c240 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x34ad8649f, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 470784 [BB6416C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 470880 [A961D6C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f205410d170 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x3136e0108, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 470897 [A961D6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 471191 [4B5616C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f2054133900 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x3ca9d60ee, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 471200 [4B5616C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 474823 [735B16C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f205413bd70 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x35763cbf0, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 474835 [735B16C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 478709 [375396C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f2054157080 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x3ca9d60f1, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 478722 [375396C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 478918 [545736C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f2054187b20 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x32e473af1, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 478930 [545736C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 479078 [A3E126C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f20540f8f80 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x34ad86483, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 479091 [A3E126C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 479802 [BE6476C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f2054154430 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x30f29eaeb, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 479809 [BE6476C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 479900 [60D8C6C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f2054120dc0 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x30f29eaea, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 479929 [60D8C6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 482301 [BB6416C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f205412c240 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x30f29eadb, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 482316 [BB6416C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 483026 [8EDE86C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f205410e4b0 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x3956be6c2, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 483047 [8EDE86C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
Sep 21 13:38:04 547048 [915ED6C0] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f20541034f0 of size 120, Class 0x3, Method 0x81, Attr 0x35, TID 0x3176b0288, port_idx 0 failed -5 (Invalid argument)
Sep 21 13:38:04 547069 [915ED6C0] 0x01 -> osm_sa_send: ERR 4C04: osm_vendor_send failed, status = IB_UNKNOWN_ERROR
This was finally resolved by us making opensm running on ConnectX-4 the master. Admittedly the opensm version running on this node is a bit older (OpenSM 5.19.0.MLNX20240421.b7c161a9), however this brought the network back to a working state and everything works now. I would like to understand why and what wrong when it did, because everything worked for months, with no changes done to any network libraries of any kind in the past few months. The last change was 2 months ago when perftest from debian stable was installed on the NDR master node for some testing. Is there a guide where these errors are listed to lookup? Thank you