I’m seeing a behavior where Connectx-6 DX card is sending CNP even when there is no ECN marked ROCE traffic from switches. I have 2 servers with Connectx-6 DX card connected (1 port from each switch) to a switch. With ib_write_bw, Server2 is doing RDMA write to Server1 at 91Gbps. Server1 sends CNP to Server2 occasionally but there is no buffer buildup in the switch to mark ECN. Any idea, what needs to be checked and tweaked on the server side? I tried with Redhat8.6 and Ubuntu 22.04 with latest MLNX OFED installed. Firmware version is 126.96.36.199.
We are using the CNP mechanism (in the NIC HW) to reduce the rate in case we observe packet drop on the receiver NIC.
In your case, I suspect Server1 is not fast enough to accept the data (you may want to performance tune it) or drops are somewhere else in the network and thus we identify drops in the HW and send CNP packet to the sender. Server2 will see the CNP packet and will utilize it to reduce the rate of the transmitter.
This feature can be disabled but it is a sign that you have a lossy network and the feature tries to overcome that (and probably does it pretty well based on the BW you described). I suspect you do not have PFC enabled or it is enabled but not utilized.
Thanks for the response Yaniv. Can you please let me know the following
Which counters should I look are to confirm HW drops? I don’t see any drops in ethtool output of that interface
If the packets are not dropped in the HW but somewhere in the fabric, how connectx identifies packet drops in the fabric without ECN from the fabric? Is this based on missing sequence number in the IB header? Are there any counters I can check for this to confirm this is what is happening? I’m pretty sure there are no drops in the fabric, but if I can confirm that connectx is identifying drops in the fabric via some mechanism, know how will help me a lot.
Are there any commands to disable the feature where the server reacts to CNP and reduce the rate. But can you provide me some guidance on how to do that? It will be helpful to test PAUSE generation.
Also apart from the receive buffer, dscp, priority, what other tuning I can do to make this better?
DCBX mode: OS controlled
Priority trust state: dscp
Receive buffer size (bytes): 130896,130896,0,0,0,0,0,0,max_buffer_size=1027728
Cable len: 7
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
tc: 1 ratelimit: unlimited, tsa: vendor
tc: 2 ratelimit: unlimited, tsa: vendor
tc: 3 ratelimit: unlimited, tsa: vendor
tc: 4 ratelimit: unlimited, tsa: vendor
tc: 5 ratelimit: unlimited, tsa: vendor
tc: 6 ratelimit: unlimited, tsa: vendor
tc: 7 ratelimit: unlimited, tsa: vendor
I do have PFC enabled, but since the servers are reacting to something and reducing the rate which prevents switches to even send ECN which is before PAUSE in my configuration.
When packets arrive they have a sequence number. If one (or more) are dropped in the fabric then we would expect to see
on requestor: /sys/class/infiniband/<mlx5_X>/ports/1/hw_counters/packet_seq_err
on responder: /sys/class/infiniband/<mlx5_X>/ports/1/hw_counters/out_of_sequence
If they are dropped in the HW there are the “ethtool -S ” counters to review (if any diff before/after the test). But there are more counters internally. For further review I suggest opening a support case.
To disable the CNP handling please disable all RoCE accelerations (Follow my post Mellanox Interconnect Community).
In order to generate traffic correctly so that PFC will work please make sure you have the following in the command line of the ib_write_bw benchmark “–tclass=96”. PFC is enabled with priority 3, DSCP 24 (per your previous post) which is mapped to traffic class 96. You need to make sure the application marks the packet correctly (with the option I described).