Understanding ConnectX HCA Behavior with Adaptive Routing: Message Ordering & Performance Questions

Hello NVIDIA Community,

We are currently utilizing ConnectX-series NICs (e.g., ConnectX-5/6) in our InfiniBand fabric with Adaptive Routing (AR) enabled. We’ve encountered some performance characteristics that we’d like to understand better, specifically concerning the internal mechanisms of the HCA in maintaining message order for RC (Reliable Connection) QPs when AR is active.

We have a few specific questions:

  1. Ordering of RDMA Write and Atomic Operations with AR: We’ve observed a scenario where if we post an RDMA Write operation followed immediately by an Atomic FAA (Fetch-and-Add) operation to the same RC QP without waiting for the completion event of the Write, the FAA operation seems to take an additional Round-Trip Time (RTT). Our understanding is that the sender HCA is responsible for sending out the packets for these operations in the order they are posted, and the receiver HCA is responsible for ensuring they are executed in the correct order on the target memory. Question: With Adaptive Routing enabled, does the sender HCA (e.g., ConnectX) introduce any new behavior, such as waiting for a hardware-level acknowledgment (ACK) for the RDMA Write packets before it dispatches the packets for the subsequent Atomic FAA operation? This could potentially explain the observed extra RTT for the FAA.
  2. Handling Out-of-Order Packets for Large RDMA Writes with AR: Adaptive Routing can lead to packets arriving out-of-order at the receiver HCA. For a large RDMA Write operation that is segmented into multiple packets, these packets might traverse different paths and arrive non-sequentially. Question: How does the receiver ConnectX HCA handle this situation for an RC QP?
  • Does it employ an internal reorder buffer to collect all packets belonging to the RDMA Write message, reassemble them in the correct sequence, and then perform the DMA transfer of the complete, ordered message into the target host memory?
  • Alternatively, is it possible that the HCA might DMA individual, out-of-order packets to different parts of the target memory buffer as they arrive? If this were the case, how would it work, given our understanding that typically only the first packet of an RDMA Write request contains the full addressing metadata (like virtual address and R_Key)?
  1. Support for Write with Immediate and SEND/RECV with AR: Just to confirm our understanding: Question: Are RDMA Write with Immediate data operations and traditional SEND/RECV message passing operations on RC QPs fully supported and expected to function correctly (maintaining all reliability and in-order delivery guarantees) when Adaptive Routing is enabled on ConnectX NICs? We assume yes, but given our performance investigations, we want to be certain.

Any insights into these internal HCA behaviors, potential performance implications of AR on these specific operation sequences, or pointers to relevant documentation would be greatly appreciated. Our goal is to better understand these interactions to optimize our application performance.

Thank you for your time and assistance!