Hi guys, I have a small cluster, connected by ten 100G mellanox connectx-4 infiniband adaptors with sb7890 switch. Most of the time they run smoothly. But sometimes, one of the node would disconnects. This node is not fixed, random node. Once disconnected, the light in the adaptor is yellow, which means its offline.
I dont know why, any hints? or how can I debug? The following is the opensm.log
Hello Kevin,
Thank you for posting your inquiry on the NVIDIA Developer Forum - Infrastructure and Networking - Section.
Based on the information provided, the issue can be related to various components in your fabric, e.g. unsupported cable, HCA f/w alignment with the switch f/w.
For these kind of issue, we recommend to make sure all code in your fabric is aligned to the supported versions. You can find the latest RN of all s/w and f/w through the following url → Site Home - NVIDIA Networking Docs
When you have all aligned and still experiencing node disconnects (logical/physical), please open a NVIDIA Networking Support case by sending an email to the following address → networking-support@nvidia.com
Thank you and regards,
~NVIDIA Networking Technical Support
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.