QM9790 Fabric Issues: Duplicate LID Learned on Multiple Ports and Switch Unresponsiveness

Hi,

We’re running a data center with multiple NVIDIA Quantum QM9790 InfiniBand Switch switches in a fabric, and we’ve encountered some instability with a subset of newly added switches.

Here’s the situation:

  • Some of the newer switches occasionally become unresponsive (appear “hung”).

  • Prior to this, we performed partial GPU server relocations — specifically, some servers were moved and reconnected to different IB switches.

  • The switches that were not involved in these changes remain stable.

From the logs and observations:

  • We’re seeing repeated errors indicating that the same LID is being learned from multiple ports.

  • When running iblinkinfo, the command hangs or times out when it tries to query the problematic switches, and only proceeds after skipping them.

  • This behavior seems isolated to the switches connected to the relocated GPU nodes.

Our current suspicion is:

  • The IB HCAs on the GPU servers may still be using stale LIDs after the relocation.

  • This could be causing LID conflicts or inconsistent forwarding behavior in the fabric.

Questions:

  1. Have you seen similar issues where LID conflicts occur after moving nodes within the fabric?

  2. What is the recommended way to ensure LIDs are properly re-assigned after topology changes? (e.g., SM restart, node reboot, or manual cleanup)

  3. Could this lead to switches becoming unresponsive, or is there likely another root cause we should investigate?

  4. Are there specific tools or commands you recommend to validate LID consistency across the fabric?

Any insights would be greatly appreciated.

Hi

You need to open a case and check it with Technical support engineer.

When you open a case please collect & upload ufm system and ibdiagent file.

#ibdiagnet -r --pc --pm_pause_time 300-P all=1 --extended_speeds all --pm_per_lane --reset_phy_info --get_phy_info --get_cable_info

#tar cvf ibdiagnet.tar /var/tmp/ibdiagnet2/

/HyungKwang

Thanks hyungkwangc

your suggestion will help me.

I’ll do it that soon..