Dear Mellanox Community,
I created a small application that mimics the concept of ib_send_bw to get familiar with libibverbs and programming for InfiniBand hardware. With the source code of ib_send_bw and resources like rdmamojo.com, it was comfortable to get something up and running. However, my own test application isn’t performing well on IBV_WR_SEND work requests compared to ib_send_bw.
The code is available here:
GitHub - stnot/ib_test https://github.com/stnot/ib_test
I run this on a cluster with two nodes connected to a 18 port Mellanox 56 gbit switch and 56gbit HCAs installed on the nodes. Please let me know if you need more information about my setup to analyze this issue.
Running ib_send_bw -a prints the following output:
Send BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x04 QPN 0x02bd PSN 0x8f0e46
remote address: LID 0x08 QPN 0x0341 PSN 0xd9ee2e
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 1000 0.00 11.00 5.765330
4 1000 0.00 35.07 9.192796
8 1000 0.00 72.67 9.524639
16 1000 0.00 145.20 9.516145
32 1000 0.00 276.78 9.069401
64 1000 0.00 584.29 9.573034
128 1000 0.00 1173.28 9.611550
256 1000 0.00 2108.89 8.637993
512 1000 0.00 3693.42 7.564126
1024 1000 0.00 4143.79 4.243243
2048 1000 0.00 4385.30 2.245272
4096 1000 0.00 4457.75 1.141185
8192 1000 0.00 4486.35 0.574253
16384 1000 0.00 4509.63 0.288616
32768 1000 0.00 4514.77 0.144473
65536 1000 0.00 4517.63 0.072282
131072 1000 0.00 4518.87 0.036151
262144 1000 0.00 4519.43 0.018078
524288 1000 0.00 4519.53 0.009039
1048576 1000 0.00 4519.82 0.004520
2097152 1000 0.00 4519.94 0.002260
4194304 1000 0.00 4519.97 0.001130
8388608 1000 0.00 4519.97 0.000565
I changed some settings to get values that are (I guess) comparable to the current settings on my own implementation:
ib_send_bw --rx-depth=100 --tx-depth=100 --size=1024 --iters=100000
Send BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 100
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x04 QPN 0x02be PSN 0xd3bf9c
remote address: LID 0x08 QPN 0x0342 PSN 0x13b557
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1024 100000 0.00 2842.02 2.910229
Running my own code posted above, I am getting a lower throughput compared to the ib_send_bw results:
./ib_server 1024 10000 msg
*** 10000 MESSAGE_SEND resulted in an average latency of 8.50us ***
114.918097 MB/sec
I analyzed the time for various sections of my code. The ibv_poll_cq sections are consuming over 99% of the execution time and return 0 (no work completion) most of the time. I suspect that something is not configured correctly and adds further processing time to each send or/and (?) receive request put to the queue. But I wasn’t able to figure out the exact cause so far.
I would appreciate it if someone of the community could take a look at my code and point out any issues with using libibverbs incorrectly/inefficiently or improper configured parameters that cause this performance loss. If you need more data about my setup or any other information that helps analyzing this issue, please let me know and I am glad to provide them.