Hi,
We’re running a data center with multiple NVIDIA Quantum QM9790 InfiniBand Switch switches in a fabric, and we’ve encountered some instability with a subset of newly added switches.
Here’s the situation:
-
Some of the newer switches occasionally become unresponsive (appear “hung”).
-
Prior to this, we performed partial GPU server relocations — specifically, some servers were moved and reconnected to different IB switches.
-
The switches that were not involved in these changes remain stable.
From the logs and observations:
-
We’re seeing repeated errors indicating that the same LID is being learned from multiple ports.
-
When running
iblinkinfo, the command hangs or times out when it tries to query the problematic switches, and only proceeds after skipping them. -
This behavior seems isolated to the switches connected to the relocated GPU nodes.
Our current suspicion is:
-
The IB HCAs on the GPU servers may still be using stale LIDs after the relocation.
-
This could be causing LID conflicts or inconsistent forwarding behavior in the fabric.
Questions:
-
Have you seen similar issues where LID conflicts occur after moving nodes within the fabric?
-
What is the recommended way to ensure LIDs are properly re-assigned after topology changes? (e.g., SM restart, node reboot, or manual cleanup)
-
Could this lead to switches becoming unresponsive, or is there likely another root cause we should investigate?
-
Are there specific tools or commands you recommend to validate LID consistency across the fabric?
Any insights would be greatly appreciated.