The softroce vertion (rdma-core vertion is 26.0 and perftest vertion is 4.8, both are the latest version in https://github.com/linux-rdma/). And the MLNX-OFED version on hard roce is MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz. We also tried the MLNX-OFED 4.7 version, it also does’t work.
rping result:
On the hard roce client:
root@train-gpu9:~ # rping -cd -a 192.168.0.20 -p 1234 -C 1
created cm_id 0xdd8d90
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0xdd8d90 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0xdd8d90 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0xdd84a0
created channel 0xdd8540
created cq 0xddaff0
created qp 0xddd308
rping_setup_buffers called on cb 0xdd53c0
allocated & registered buffers…
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0xdd8d90 (parent)
ESTABLISHED
rmda_connect successful
RDMA addr dd60f0 rkey 24fb2 len 64
send completion
recv completion
RDMA addr dd8560 rkey 242a5 len 64
send completion
recv completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0xdd8d90 (parent)
client DISCONNECT EVENT…
rping_free_buffers called on cb 0xdd53c0
destroy cm_id 0xdd8d90
11:18 root@train-gpu9:~ #
On the softroce server:
root@train-gpu10:~# rping -sd -a 192.168.0.20 -p 1234 -C 1
port 1234
count 1
created cm_id 0x1760820
rdma_bind_addr successful
rdma_listen
cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x7fe504000a30 (child)
child cma 0x7fe504000a30
created pd 0x17580b0
created channel 0x17580d0
created cq 0x1763070
created qp 0x1763120
rping_setup_buffers called on cb 0x17573c0
allocated & registered buffers…
accepting client connection request
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x7fe504000a30 (child)
ESTABLISHED
recv completion
Received rkey 24fb2 addr dd60f0 len 64 from peer
server received sink adv
server posted rdma read req
rdma read completion
server received read complete
server posted go ahead
send completion
recv completion
Received rkey 242a5 addr dd8560 len 64 from peer
server received sink adv
rdma write from lkey 1258 laddr 17632a0 len 64
rdma write completion
server rdma write complete
server posted go ahead
send completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x7fe504000a30 (child)
server DISCONNECT EVENT…
wait for RDMA_READ_ADV state 10
rping_free_buffers called on cb 0x17573c0
destroy cm_id 0x1760820
root@train-gpu10:~#
ib_send_bw result:
On the hard roce client:
root@train-gpu9:/etc/libibverbs.d # ib_send_bw -a -d mlx5_0 192.168.0.20
Requested SQ size might be too big. Try reducing TX depth and/or inline size.
Current TX depth is 128 and inline size is 0 .
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0000 QPN 0x0106 PSN 0x7511e0
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:19
remote address: LID 0000 QPN 0x0011 PSN 0xaaa186
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:20
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Completion with error at client
Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
On the softroce server:
root@train-gpu10:~# ib_send_bw -a
- Waiting for client to connect… *
Send BW Test
Dual-port : OFF Device : rxe0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0000 QPN 0x0011 PSN 0xaaa186
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:20
remote address: LID 0000 QPN 0x0106 PSN 0x7511e0
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:19
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
^C
root@train-gpu10:~#
Why can’t they work together? Can you tell us the correct vertion of softroce and hard roce that can work together?