Connectx-6 DX card sending CNP even when there is no ECN marked ROCE traffic from switches

I’m seeing a behavior where Connectx-6 DX card is sending CNP even when there is no ECN marked ROCE traffic from switches. I have 2 servers with Connectx-6 DX card connected (1 port from each switch) to a switch. With ib_write_bw, Server2 is doing RDMA write to Server1 at 91Gbps. Server1 sends CNP to Server2 occasionally but there is no buffer buildup in the switch to mark ECN. Any idea, what needs to be checked and tweaked on the server side? I tried with Redhat8.6 and Ubuntu 22.04 with latest MLNX OFED installed. Firmware version is 22.36.10.10.

1 Like

Hi,
We are using the CNP mechanism (in the NIC HW) to reduce the rate in case we observe packet drop on the receiver NIC.
In your case, I suspect Server1 is not fast enough to accept the data (you may want to performance tune it) or drops are somewhere else in the network and thus we identify drops in the HW and send CNP packet to the sender. Server2 will see the CNP packet and will utilize it to reduce the rate of the transmitter.
This feature can be disabled but it is a sign that you have a lossy network and the feature tries to overcome that (and probably does it pretty well based on the BW you described). I suspect you do not have PFC enabled or it is enabled but not utilized.
Regards,
Yaniv

Thanks for the response Yaniv. Can you please let me know the following
Which counters should I look are to confirm HW drops? I don’t see any drops in ethtool output of that interface
If the packets are not dropped in the HW but somewhere in the fabric, how connectx identifies packet drops in the fabric without ECN from the fabric? Is this based on missing sequence number in the IB header? Are there any counters I can check for this to confirm this is what is happening? I’m pretty sure there are no drops in the fabric, but if I can confirm that connectx is identifying drops in the fabric via some mechanism, know how will help me a lot.
Are there any commands to disable the feature where the server reacts to CNP and reduce the rate. But can you provide me some guidance on how to do that? It will be helpful to test PAUSE generation.
Also apart from the receive buffer, dscp, priority, what other tuning I can do to make this better?
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
default priority:
Receive buffer size (bytes): 130896,130896,0,0,0,0,0,0,max_buffer_size=1027728
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7

I do have PFC enabled, but since the servers are reacting to something and reducing the rate which prevents switches to even send ECN which is before PAUSE in my configuration.

When packets arrive they have a sequence number. If one (or more) are dropped in the fabric then we would expect to see
on requestor: /sys/class/infiniband/<mlx5_X>/ports/1/hw_counters/packet_seq_err
on responder: /sys/class/infiniband/<mlx5_X>/ports/1/hw_counters/out_of_sequence

If they are dropped in the HW there are the “ethtool -S ” counters to review (if any diff before/after the test). But there are more counters internally. For further review I suggest opening a support case.

To disable the CNP handling please disable all RoCE accelerations (Follow my post Mellanox Interconnect Community).

In order to generate traffic correctly so that PFC will work please make sure you have the following in the command line of the ib_write_bw benchmark “–tclass=96”. PFC is enabled with priority 3, DSCP 24 (per your previous post) which is mapped to traffic class 96. You need to make sure the application marks the packet correctly (with the option I described).

Regards,
Yaniv

Hello Yanivserlin,
unfortunately, the link that you provided does not seem to work anymore, can you point to an updated link?

Specifically, the link that mentions how to turn off this cnp behavior in cx6:
https://mellanox.my.site.com/mellanoxcommunity/s/article/How-to-Enable-Disable-Lossy-RoCE-Accelerations

Also, just making sure that I got this right, connectx6 will send CNP even if the switch does not set ECN bits right?
So, CX6 determines whether to send CNP packets not just based on switch setting ecn bits, but also some internal hw counter that tracks things like out of order sequences?

Thanks in advance

https://nvcrm.my.site.com/ESPCommunity/s/article/How-to-Enable-Disable-Lossy-RoCE-Accelerations
You may need to be registered user and logged in.
Yes in regards to the second question.
Regards,
Yaniv

Hi Yanivserlin,
Thanks so much for you answer.
I was able to access another forum post on enable lossy roce accelerations: Error checking lossy RoCE acceleration state - #2 by MvB

and the guide instructed to set/unset the registers / flags:
roce_adp_retrans_en, roce_tx_window_en, roce_slow_restart_en

However, I was unable to find an explanation of what each of the flags meant and how they interact to enable / disable lossy roce accelerations.

  1. Is there a comprehensive document that outlines what flags / registers are available and what they mean?
  2. My understanding of the three flags are:
  • roce_adp_retrans_en: enable “adaptive” retransmission where adaptive refers to the non ecn triggered cnp packets
  • roce_tx_window_en: some window based congestion control?
  • roce_slow_restart_en: some flag to set the “rate control”
    But due to insufficient documentation that I could find, my understanding is still very limited.

It would be of tremendous help, if you can point me to the right documentation or explain that those features are.

Thank you very much

Hi,
There is no comprehensive document on these features. the set of features supported is dependent on which HW device is used and teh FW version.
As far as the ones you have asked about let em give a short description:
roce_adp_retrans_en – How to change ack timeout value in CX5 - #2 by yanivserlin
roce_tx_window_en – Per QP window based control to limit the number of packets in-the-air, thus limit the max number of Go-Back-N retransmissions.
roce_slow_restart_en – We monitor the link and if we observe that there were dropped packets we assume it was due to congestion. We will then have an internal mechanism to reduce the rate of the outgoing traffic and by doing so we will start slow to allow the network to recover and increase the rate gradually.
Regards,
Yaniv