RoCEv2 PFC/ECN Issues

We have two servers with ConnectX-4 100Ge cards and two Cisco C3232C switches with routing between them and are trying to get RoCEv2 routing through with PFC/ECN to provide the best performance during periods of congestion.

The funny thing is using base configuration with no other servers on the switches, we get terrible performance (1.6 Gbps) across the routed link using iSER when we are only using about 20 Gbps (1 iSER connection and test workload configuration). By using multiple iSER connections and PFC, we can get about 95 Gbps, so we know that the hardware is capable of the performance in routing mode. We can’t understand why in the default case the performance is so bad. The fio test shows that a lot of IO happens, then there is none and it just cycles back and forth.

We would like to use both PFC and ECN for our configuration, but we are trying to validate that ECN will work without PFC and when we disable PFC, we can’t test ECN most likely because of the above issue.

On the Cisco switches, we have policy maps that places our traffic with the DSCP markings into a group that has ECN enabled (I’m not a Cisco person, so I may not be getting the terminology quite right) and we can see the group counters on the Cisco incrementing. We don’t ever see any packets marked with congestion, probably because the switch never sees any due to the above problem.

When we have the client set to 40 Gbps and do a read test with PFC, we get pause frames and great performance. We have the Cisco switches match the DSCP value and remark the COS for packets that traverse the router (interesting enough Cisco sends PFC pause frames on the routed link even though there are no VLANs configured. We captured it in wireshark, but with the adapters set to --trust=pcp, the performance in terrible, but --trust=dscp works well). The Cisco switches also show pause frame counters incrementing when we are 100g end to end. I’m not sure why it would be incrementing when there is no congestion.

We have done so many permutations of tests, that I may be getting fuzzy in some details. Here is a matrix of some tests that I can be sure of. This is all 100g end to end.

switch PFC mode (ports)trust modepfc prio 3 enabledskprio → cos mappingResultstatic on/offmlnx_qos --trust=Xmlnx_qos --pfc=0,0,0,X,0,0,0,0ip link set rsY.Z type vlan egress 2:3onpcpyesyesGoodonpcpyesnoGoodonpcpnoyesBadonpcpnonoBadondscpyesyesGoodondscpyesnoGoodondscpnoyesBadondscpnonoBadoffpcpyesyesBadoffpcpyesnoBadoffpcpnoyesBadoffpcpnonoBadoffdscpyesyesBadoffdscpyesnoBadoffdscpnoyesBadoffdscpnonoBad

We are using OFED 4.4-1.0.0.0 on both nodes, one is CentOS 7.3, the other CentOS 7.4, running 4.9.116 and the firmware is 12.23.1000 on one card and 12.23.1020 on the other. In addition to the above matrix, we have only changed:

echo 26 > /sys/class/net/rs8bp2/ecn/roce_np/cnp_dscp

echo 106 > /sys/kernel/config/rdma_cm/mlx5_3/ports/1/default_roce_tos

If you have any ideas that we can try, we would appreciate it.

Thank you.

What happens when you run ib_read_bw test?

Hi Robert,

Please, follow this link - Recommended Network Configuration Examples for RoCE Deployment https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment - to configure your the host and the switch. When using non-Mellanox switch, check with switch vendor what are the corresponding commands.