Hi guys, I have a small cluster, connected by ten 100G mellanox connectx-4 infiniband adaptors with sb7890 switch. Most of the time they run smoothly. But sometimes, one of the node would disconnects. This node is not fixed, random node. Once disconnected, the light in the adaptor is yellow, which means its offline.
I dont know why, any hints? or how can I debug? The following is the opensm.log

yesterday another node is offline. Pls see the image

Hello Kevin,

Thank you for posting your inquiry on the NVIDIA Developer Forum - Infrastructure and Networking - Section.

Based on the information provided, the issue can be related to various components in your fabric, e.g. unsupported cable, HCA f/w alignment with the switch f/w.

For these kind of issue, we recommend to make sure all code in your fabric is aligned to the supported versions. You can find the latest RN of all s/w and f/w through the following url → Site Home - NVIDIA Networking Docs

When you have all aligned and still experiencing node disconnects (logical/physical), please open a NVIDIA Networking Support case by sending an email to the following address → networking-support@nvidia.com

Thank you and regards,
~NVIDIA Networking Technical Support

