Hi all,
I read that starting from CX-6 DX NICs, Selective Repeat (SR) is supported in RoCE traffic. However, according to my experiments, it does not seem to take effect even if I have enabled that in mlxconfig.
Topology and Settings
RoCE Client machine (C) <---> [Port 0] Middlebox machine (M) [Port 1] <---> RoCE server machine (S)
Client setting
$ lspci | grep -i ether
...Mellanox Technologies MT2910 Family [ConnectX-7]...
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.43.3608
...
$ sudo mlxconfig -d /dev/mst/mt4129_pciconf0 q | grep -i -e rdma -e roce
ROCE_CC_RTT_TIMESTAMP_FORMAT DEVICE_DEFAULT(0)
ROCE_CC_DCQCN_COMPATIBILITY_MODE DEVICE_DEFAULT(0)
ROCE_CC_LEGACY_DCQCN_SW False(0)
ROCE_CONTROL ROCE_ENABLE(2)
ROCE_NEXT_PROTOCOL 254
ROCE_CC_PRIO_MASK_P1 255
ROCE_CC_CNP_MODERATION_P1 DEVICE_DEFAULT(0)
ROCE_CC_SHAPER_COALESCE_P1 DEVICE_DEFAULT(0)
ROCE_RTT_RESP_DSCP_P1 0
ROCE_RTT_RESP_DSCP_MODE_P1 DEVICE_DEFAULT(0)
**RDMA_SELECTIVE_REPEAT_EN True(1)**
ROCE_ADAPTIVE_ROUTING_EN True(1)
Using command: ib_write_bw 10.2.94.2 -d mlx5_0 -R -c RC -m 1024 --report_gbits -F -s 32768 -n 100
Note that RDMA_SELECTIVE_REPEAT_EN is enabled via sudo mlxconfig -d /dev/mst/mt4129_pciconf0 set RDMA_SELECTIVE_REPEAT_EN=1 and then sudo mlxfwreset -d /dev/mst/mt4129_pciconf0 -l 3 reset to apply it.
Server setting
$ lspci | grep -i ether
...Mellanox Technologies MT2910 Family [ConnectX-7]...
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.43.2566
...
$ sudo mlxconfig -d /dev/mst/mt4129_pciconf0 q | grep -i -e roce -e rdma
ROCE_NEXT_PROTOCOL 254
**RDMA_SELECTIVE_REPEAT_EN True(1)**
ROCE_ADAPTIVE_ROUTING_EN True(1)
ROCE_CC_DCQCN_COMPATIBILITY_MODE DEVICE_DEFAULT(0)
ROCE_CC_LEGACY_DCQCN_SW False(0)
ROCE_CC_PRIO_MASK_P1 255
ROCE_CC_CNP_MODERATION_P1 DEVICE_DEFAULT(0)
ROCE_CC_SHAPER_COALESCE_P1 DEVICE_DEFAULT(0)
ROCE_RTT_RESP_DSCP_P1 0
ROCE_RTT_RESP_DSCP_MODE_P1 DEVICE_DEFAULT(0)
ROCE_CONTROL ROCE_ENABLE(2)
Using command: ib_write_bw -d mlx5_0 -R -c RC -m 1024 --report_gbits -F -s 32768 -n 100
Middlebox setting
71:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
71:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)
71:00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)
Port 0 and Port 1 are VFs on BF-3 NIC. RoCE feature is disabled to allow RDMA packets seen by CPU. The machine runs a simple Click-based network function that forwards bidirectional packets, and drops C to S packets at a fixed probability (S to C is not affected).
It also dump packets that are not dropped to pcap files.
Results
Packet with PSN 955760 is first dropped (because PSN of packet 41 and 42 is inconsistent):
And it is then retransmitted at packet 310. At that time, the RNIC has just finished sending a message (Packet 309), and go back to restore a previous PSN (956060 → 955760):
However, when retransmitting 955760, the NIC tries to send also 955761 and 955762, which are not dropped earlier. This is a typical GBN behavior instead of SR, since it does not “fix the hole” only.
Question
- Is my method of enabling SR correct? If not, what is the idiomatic way to do that?
- Do I need additional software config to enable SR? Currently I am using
ib_write_bwwith RDMA CM (the-Roption). Does that automatically leverage SR? If not, how to enable that in software?
Thanks,
Fengkai

