Own libibverbs application similar to ib_send_bw with low throughput

Dear Mellanox Community,

I created a small application that mimics the concept of ib_send_bw to get familiar with libibverbs and programming for InfiniBand hardware. With the source code of ib_send_bw and resources like rdmamojo.com, it was comfortable to get something up and running. However, my own test application isn’t performing well on IBV_WR_SEND work requests compared to ib_send_bw.

The code is available here:

GitHub - stnot/ib_test https://github.com/stnot/ib_test

I run this on a cluster with two nodes connected to a 18 port Mellanox 56 gbit switch and 56gbit HCAs installed on the nodes. Please let me know if you need more information about my setup to analyze this issue.

Running ib_send_bw -a prints the following output:


Send BW Test

Dual-port : OFF Device : mlx4_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

RX depth : 512

CQ Moderation : 100

Mtu : 2048[B]

Link type : IB

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x04 QPN 0x02bd PSN 0x8f0e46

remote address: LID 0x08 QPN 0x0341 PSN 0xd9ee2e


#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

2 1000 0.00 11.00 5.765330

4 1000 0.00 35.07 9.192796

8 1000 0.00 72.67 9.524639

16 1000 0.00 145.20 9.516145

32 1000 0.00 276.78 9.069401

64 1000 0.00 584.29 9.573034

128 1000 0.00 1173.28 9.611550

256 1000 0.00 2108.89 8.637993

512 1000 0.00 3693.42 7.564126

1024 1000 0.00 4143.79 4.243243

2048 1000 0.00 4385.30 2.245272

4096 1000 0.00 4457.75 1.141185

8192 1000 0.00 4486.35 0.574253

16384 1000 0.00 4509.63 0.288616

32768 1000 0.00 4514.77 0.144473

65536 1000 0.00 4517.63 0.072282

131072 1000 0.00 4518.87 0.036151

262144 1000 0.00 4519.43 0.018078

524288 1000 0.00 4519.53 0.009039

1048576 1000 0.00 4519.82 0.004520

2097152 1000 0.00 4519.94 0.002260

4194304 1000 0.00 4519.97 0.001130

8388608 1000 0.00 4519.97 0.000565


I changed some settings to get values that are (I guess) comparable to the current settings on my own implementation:

ib_send_bw --rx-depth=100 --tx-depth=100 --size=1024 --iters=100000


Send BW Test

Dual-port : OFF Device : mlx4_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

RX depth : 100

CQ Moderation : 100

Mtu : 2048[B]

Link type : IB

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x04 QPN 0x02be PSN 0xd3bf9c

remote address: LID 0x08 QPN 0x0342 PSN 0x13b557


#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

1024 100000 0.00 2842.02 2.910229


Running my own code posted above, I am getting a lower throughput compared to the ib_send_bw results:

./ib_server 1024 10000 msg

*** 10000 MESSAGE_SEND resulted in an average latency of 8.50us ***

114.918097 MB/sec

I analyzed the time for various sections of my code. The ibv_poll_cq sections are consuming over 99% of the execution time and return 0 (no work completion) most of the time. I suspect that something is not configured correctly and adds further processing time to each send or/and (?) receive request put to the queue. But I wasn’t able to figure out the exact cause so far.

I would appreciate it if someone of the community could take a look at my code and point out any issues with using libibverbs incorrectly/inefficiently or improper configured parameters that cause this performance loss. If you need more data about my setup or any other information that helps analyzing this issue, please let me know and I am glad to provide them.