I am trying to run four instances of ib_send_bw with UD between two computers with four MCX515A-CCAT NICs installed in each. I can run two instances. The third fails with an error.

Both computers are running RHEL 7.6. All 8 NICs are configured the same way.

On the server, in 3 separate windows I am running the following commands:

$ ib_send_bw -d mlx5_0 -c UD --report_gbits -i 1 -F --run_infinitely

$ ib_send_bw -d mlx5_1 -c UD -p 18520 --report_gbits -i 1 -F --run_infinitely

$ ib_send_bw -d mlx5_2 -c UD -p 18521 --report_gbits -i 1 -F --run_infinitely

On the client, the commands being run are:

$ ib_send_bw -d mlx5_0 -c UD -i 1 -F --report_gbits --run_infinitely 10.10.10.3

$ ib_send_bw -d mlx5_1 -c UD -i 1 -F -p 18520 --report_gbits --run_infinitely 10.10.10.4

$ ib_send_bw -d mlx5_2 -c UD -i 1 -F -p 18521 --report_gbits --run_infinitely 10.10.10.5

The first two instances run fine. The third instance fails with the following outputs.

server:

$ ib_send_bw -d mlx5_2 -c UD -p 18521 --report_gbits -i 1 -F --run_infinitely


  • Waiting for client to connect… *


Send BW Test

Dual-port : OFF Device : mlx5_2

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

CQ Moderation : 100

Mtu : 4096[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0000 QPN 0x010f PSN 0xc8809d

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:05

remote address: LID 0000 QPN 0x0111 PSN 0x7a219e

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:09

ethernet_read_keys: Couldn’t read remote address

Unable to read to socket/rdam_cm

Failed to exchange data between server and clients

Client:

$ ib_send_bw -d mlx5_2 -c UD -i 1 -F -p 18521 --report_gbits --run_infinitely 10.10.10.5


Send BW Test

Dual-port : OFF Device : mlx5_2

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 4096[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0000 QPN 0x0111 PSN 0x7a219e

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:09

remote address: LID 0000 QPN 0x010f PSN 0xc8809d

GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:10:10:05

libibverbs: resolver: Neighbour doesn’t have a hw addr

libibverbs: resolver: Unspecific failurelibibverbs: Neigh resolution process failed

Failed to create AH for UD

Unable to Connect the HCA’s through the link

Any thoughts or suggestions would be most appreciated.

Thanks,

Terry

Please, configure your devices to have different IP subnets. 10.X.X.X, 11.X.X.X, 12.X.X.X and see if it helps.

Be sure to use same software stack - Mellanox OFED on both sides. If the issue happens with Inbox version of the driver, the question need to be addressed to RedHat. Mellanox support is limited to Mellanox OFED stack.

Hi Aleksey,

I am using Mellanox OFED on both sides

(OFED-4.7-1.0.0).

Changing the NICs to be on different subnets fixed

the problem. Now all four NIC pairs are running

ib_send_bw simultaneously. Thanks.

I have a follow up question: ib_send_bw reports

the bandwidth being achieved but it does not

report losses. Would you happen to know of a

test program that reports both band and any

packet/data losses that were encountered for RDMA UD?

Thanks again,

Terry

Nothing that comes to my mind. However, if using Mellanox hardware, you might be interested in looking into hardware counters and see if the amount of data sent is the same as received - https://community.mellanox.com/s/article/understanding-mlx5-linux-counters-and-status-parameters#jive_content_id_HW_Counters_RDMA_diagnostics

or maybe there are some drops/error/discards