Selective repeat does not seem to work on CX-7 NICs even enabled in mlxconfig

Hi all,

I read that starting from CX-6 DX NICs, Selective Repeat (SR) is supported in RoCE traffic. However, according to my experiments, it does not seem to take effect even if I have enabled that in mlxconfig.

Topology and Settings

RoCE Client machine (C) <---> [Port 0] Middlebox machine (M) [Port 1] <---> RoCE server machine (S)

Client setting

$ lspci | grep -i ether
...Mellanox Technologies MT2910 Family [ConnectX-7]...

$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.43.3608
        ...

$ sudo mlxconfig -d /dev/mst/mt4129_pciconf0 q | grep -i -e rdma -e roce
        ROCE_CC_RTT_TIMESTAMP_FORMAT                DEVICE_DEFAULT(0)   
        ROCE_CC_DCQCN_COMPATIBILITY_MODE            DEVICE_DEFAULT(0)   
        ROCE_CC_LEGACY_DCQCN_SW                     False(0)            
        ROCE_CONTROL                                ROCE_ENABLE(2)      
        ROCE_NEXT_PROTOCOL                          254                 
        ROCE_CC_PRIO_MASK_P1                        255                 
        ROCE_CC_CNP_MODERATION_P1                   DEVICE_DEFAULT(0)   
        ROCE_CC_SHAPER_COALESCE_P1                  DEVICE_DEFAULT(0)   
        ROCE_RTT_RESP_DSCP_P1                       0                   
        ROCE_RTT_RESP_DSCP_MODE_P1                  DEVICE_DEFAULT(0)   
        **RDMA_SELECTIVE_REPEAT_EN                    True(1)**             
        ROCE_ADAPTIVE_ROUTING_EN                    True(1)

Using command: ib_write_bw 10.2.94.2 -d mlx5_0 -R -c RC -m 1024 --report_gbits -F -s 32768 -n 100

Note that RDMA_SELECTIVE_REPEAT_EN is enabled via sudo mlxconfig -d /dev/mst/mt4129_pciconf0 set RDMA_SELECTIVE_REPEAT_EN=1 and then sudo mlxfwreset -d /dev/mst/mt4129_pciconf0 -l 3 reset to apply it.

Server setting

$ lspci | grep -i ether
...Mellanox Technologies MT2910 Family [ConnectX-7]...

$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.43.2566
        ...

$ sudo mlxconfig -d /dev/mst/mt4129_pciconf0 q | grep -i -e roce -e rdma
        ROCE_NEXT_PROTOCOL                          254
        **RDMA_SELECTIVE_REPEAT_EN                    True(1)**
        ROCE_ADAPTIVE_ROUTING_EN                    True(1)
        ROCE_CC_DCQCN_COMPATIBILITY_MODE            DEVICE_DEFAULT(0)
        ROCE_CC_LEGACY_DCQCN_SW                     False(0)
        ROCE_CC_PRIO_MASK_P1                        255
        ROCE_CC_CNP_MODERATION_P1                   DEVICE_DEFAULT(0)
        ROCE_CC_SHAPER_COALESCE_P1                  DEVICE_DEFAULT(0)
        ROCE_RTT_RESP_DSCP_P1                       0
        ROCE_RTT_RESP_DSCP_MODE_P1                  DEVICE_DEFAULT(0)
        ROCE_CONTROL                                ROCE_ENABLE(2)

Using command: ib_write_bw -d mlx5_0 -R -c RC -m 1024 --report_gbits -F -s 32768 -n 100

Middlebox setting

71:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
71:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)
71:00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)

Port 0 and Port 1 are VFs on BF-3 NIC. RoCE feature is disabled to allow RDMA packets seen by CPU. The machine runs a simple Click-based network function that forwards bidirectional packets, and drops C to S packets at a fixed probability (S to C is not affected).
It also dump packets that are not dropped to pcap files.

Results

Packet with PSN 955760 is first dropped (because PSN of packet 41 and 42 is inconsistent):

And it is then retransmitted at packet 310. At that time, the RNIC has just finished sending a message (Packet 309), and go back to restore a previous PSN (956060 → 955760):

However, when retransmitting 955760, the NIC tries to send also 955761 and 955762, which are not dropped earlier. This is a typical GBN behavior instead of SR, since it does not “fix the hole” only.

Question

  1. Is my method of enabling SR correct? If not, what is the idiomatic way to do that?
  2. Do I need additional software config to enable SR? Currently I am using ib_write_bw with RDMA CM (the -R option). Does that automatically leverage SR? If not, how to enable that in software?

Thanks,
Fengkai

Hi Fengkai,

You are doing the right thing.

What i suggest is upgrading the FW as it is quite old (and we had some issues with previous FW versions. Not sure if they were on this one or not).

If it still does not work, I suggest to open a support case.

Regards,

Yaniv