I am running different perftests like ib_send_bw on two servers with ConnectX-7s over an Infiniband network. The lmc is set to 3.
The following are the commands I use:
Receiver:./perftest/ib_send_bw -d mlx5_0
Sender: ./perftest/ib_send_bw -d mlx5_0 --dlid 48 [RECEIVER_IP]
(48 is the base LID for mlx5_0 on the receiver)
This works perfectly. However, if I keep the server command the same, but change the client command to
Sender: ./perftest/ib_send_bw -d mlx5_0 --dlid 49 [RECEIVER_IP]
it fails, despite 49 also being a valid LID for the receiver.
Why is this the case, and how can I fix this issue?
If you’re running ib_send_bw performance tests on servers with ConnectX-7 adapters over an InfiniBand network and using lmc=3, you might encounter issues when trying to use non-base LIDs. Here’s a breakdown of why this happens and how to fix it.
Why Does It Work with Base LID (48) but Fail with LID 49?
When lmc=3 is set, the Subnet Manager (SM) assigns multiple LIDs to a single port (up to 2lmc2lmc, or 8 in this case). These LIDs are used for multipath routing, enabling traffic distribution across different paths in the network. However, not all LIDs may be valid for communication unless the routing tables are properly configured.
Base LID (48): This is explicitly configured by the SM and is guaranteed to work.
Non-Base LIDs (e.g., 49): These may fail if the SM hasn’t configured valid routes for them or if intermediate switches don’t have proper forwarding table entries.
How to Fix the Issue
Here are some steps you can take to resolve this problem:
Verify Routing Configuration:
Use tools like ibroute or ibnetdiscover to inspect the routing tables and ensure that all LIDs assigned to the receiver’s port have valid paths.
Check if LID 49 specifically has a route to the destination.
Force Multipath Usage:
Confirm that your Subnet Manager supports and is configured for multipath routing with multiple LIDs.
If needed, restart the Subnet Manager (opensm) to regenerate routing tables.
Test Connectivity for All LIDs:
Use diagnostic tools like ibping or ib_send_bw to test each LID individually.
Example: Run ib_send_bw -d mlx5_0 --dlid [LID] [RECEIVER_IP] for each LID to confirm connectivity.
Check SM Configuration:
Ensure that lmc=3 is consistently set across all nodes in your network.
Review your SM logs for any errors related to path computation or table updates.
Consider Using GIDs:
If your network spans multiple subnets or requires more robust addressing, consider switching to GID-based communication instead of relying solely on LIDs.
By following these steps, you can ensure that all assigned LIDs, including non-base ones like 49, are properly configured and usable in your tests. If you have further questions, feel free to ask!
ibnetdiscover does show that all LIDs have valid paths.
opensm is running.
ibping works with non-base LIDs. It’s just ib_send_bw that does not and results in this error.
Completion with error at client
Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
lmc=3 is set everywhere, and there are no significant errors in the logs.
I want to test using different LIDs bc different LIDs have different paths through the network.
Do you have any suggestions to resolve the error I mentioned for ib_send_bw, given that the network appears to be configured correctly and I can use the non-base LID with other applications, but not ib_send_bw?