Hello,
We are utilizing an HDR 40-port QSFP56 InfiniBand Switch (Mellanox/NVIDIA OEM) and have experienced a significant performance regression following an upgrade to Firmware v3.11.4002: the SM Failover time has increased substantially from 5 seconds to approximately 8 seconds.
Our analysis suggests that this delay is highly correlated with a specific implementation detail within the Multicast Join MAD (Management Datagram) retransmission logic in the InfiniBand Core driver. We require a detailed technical confirmation.
1. Technical Assertion: Un-resolved Destination LID in Retransmission
During an SM Failover event, when a client node (HCA) attempts to re-join a multicast group via rdma_join_multicast(), the underlying MAD retransmission mechanism appears to lead to a guaranteed failure scenario:
-
Observed Behavior: The function chain (
ib_sa_mcmember_rec_query()) sets an overall 3000ms timeout, which is split into 10 retries with a 300ms timeout at the lower driver layer. -
The Problem: When Failover occurs, the retransmitted MAD packets are not updated with the new Master SM’s LID. They are repeatedly sent to the LID of the previous (now failed/dead) Master SM.
-
Hypothesized Result: All retries to the old LID fail, causing the operation to hit the 3-second upper-layer timeout before exhausting the 10 retries, resulting in a premature
RDMA_CM_MULTICAST_ERROR.
We assert that this logic—the failure to re-resolve the Master SM’s LID during retransmission—is inefficient and a potential bug that undermines the purpose of retransmissions in a Failover scenario. We believe this behavior is the direct cause of the Failover time extending from 5 seconds to 8 seconds.
2. Specific Questions Regarding NVIDIA Implementation
-
Confirmation of Logic: Please confirm explicitly whether the behavior of retransmitting Multicast Join MADs to the old Master SM LID during an SM Failover is the intended design (
Working as Designed) of the NVIDIA/Mellanox InfiniBand stack. -
Design Rationale and Firmware Impact: If this is the intended design, please provide the technical justification for its adoption over an optimal approach. Crucially, was there any change in the behavior of this ‘guaranteed failure logic’ in Firmware v3.11.4002 that has directly led to the observed increase in Failover time?
-
Roadmap for Improvement: To ensure stability and low-latency characteristics during Failover, is there any planned firmware or driver modification to incorporate at least one Master SM LID Re-resolution attempt during the retransmission cycle?
We require a clear, technical explanation of the stack’s underlying implementation and its correlation with the performance regression on Firmware v3.11.4002. Thank you for your detailed input.