[PFC+CC doesn't work] Enabling PFC disables DCQCN

Hello all

I have some ConnectX6-Dxs (106ANCDAT) connected to Cisco switch(N5624Q) and have some RoCEv2 congestion scenarios.

At First, I enabled PFC on Cisco switch but I didn’t enabled PFC on NIC.

Thus, NIC never paused but it controls its sending rate with CNP and DCQCN (I could observe increasing CNPs from /sys/class/infiniband/mlx5_0/ports/1/hw_counters/rp_cnp_handled)

However, when I enabled PFC on NIC with dscp (i.e., mlnx_qos -i ens4f0np0 --pfc=0,0,0,1,0,0,0,0 --trust=dscp) , CNP never comes… (not even sent by Notification Point)

But Pause frame arrives at NIC in rx_prio3_pause.

Since PFC threshold is higher than CNP threshold, I guess there must be multiple CNP arrivals but it didn’t happend.

Strange thing is that, with mlnx_qos -i ens4f0np0 --pfc=0,0,0,1,0,0,0,0 --trust=pcp, CNP and Pause frame arrives but, every packet comes and goes to rx_prio0 and tx_prio0, not prio3 or prio6.

Therefore, there’s prio3 pause but cannot handle real data flow (i.e., prio0)

Is there any bug or issue for PFC+DCQCN?

Any help will be super appreciated

  • current output of mlnx_qos
$ sudo mlnx_qos -i ens4f0np0 --trust=dscp
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
        prio:0 dscp:07,06,05,04,03,02,01,00,
        prio:1 dscp:15,14,13,12,11,10,09,08,
        prio:2 dscp:23,22,21,20,19,18,17,16,
        prio:3 dscp:31,30,29,28,27,26,25,24,
        prio:4 dscp:39,38,37,36,35,34,33,32,
        prio:5 dscp:47,46,45,44,43,42,41,40,
        prio:6 dscp:55,54,53,52,51,50,49,48,
        prio:7 dscp:63,62,61,60,59,58,57,56,
default priority:
Receive buffer size (bytes): 20016,156096,0,0,0,0,0,0,max_buffer_size=1027728
Cable len: 7
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   1   0   0   0   0   
        buffer      0   0   0   1   0   0   0   0   
tc: 0 ratelimit: unlimited, tsa: vendor
         priority:  1
tc: 1 ratelimit: unlimited, tsa: vendor
         priority:  0
tc: 2 ratelimit: unlimited, tsa: vendor
         priority:  2
tc: 3 ratelimit: unlimited, tsa: vendor
         priority:  3
tc: 4 ratelimit: unlimited, tsa: vendor
         priority:  4
tc: 5 ratelimit: unlimited, tsa: vendor
         priority:  5
tc: 6 ratelimit: unlimited, tsa: vendor
         priority:  6
tc: 7 ratelimit: unlimited, tsa: vendor
         priority:  7

Best regard
Taekyoung

RoCE is E to E solution. means not only configure by NIC but also Switch.

We not familiar with other vender’s switch. But NVIDIA switch need configure ECN base on prio then DCQCN can work.

Hello @xiaofengl

Thanks for your comment.

So we configured ECN and PFC at Cisco switch with NVIDIA guidance ( ESPCommunity)

And every DCQCN operation works well without DSCP based PFC (even PCP based PFC + DCQCN works, with small malfunction, it pauses prio3 but every data comes to prio0)

Thus, I guess this is NIC problem.

Here’s a table with experience

  • w/o mlnx_qos’s PFC enabling: DCQCN works (cnp counter increase), Pause frame occurs in fabric (not NIC)
  • w/ mlnx_qos’s DSCP-based PFC enable at prio 3: DCQCN doesn’t works (cnp counter no change), Only PFC works well at NIC
  • w/ mlnx_qos’s PCP-based PFC enable at prio 3: DCQCN works (cnp counter increase), Only PFC works at prio3 but there’s no packet at prio3, every packet’s are on prio0

Therefore, DCQCN should work with other priority group but it didn’t.

We also confirmed ecn is enabled roce_np and roce_rps every priority.

Can OFED or FW version can affect this issue?

Hi, I just found out a new thing:
PFC+DCQCN(CNP operation) works when I disable PFC on only a portion of all nodes and enable PFC on the rest. (option 3. below)

That is, we tried 3 options that gives different results:

  1. When all nodes set with (mlnx_qos -i ens4f0np0 --pfc=0,0,0,1,0,0,0,0 --trust=dscp):
    Actual data is transmitted on prio3, pause is working on prio3, and CNP is not transmitted on prio6

  2. When all nodes set with (mlnx_qos -i ens4f0np0 --pfc=0,0,0,0,0,0,0,0 --trust=dscp):
    Actual data is transmitted on prio3, pause is not working, and CNP is sent&received on prio6

  3. When some nodes are set with (--pfc=0,0,0,1,0,0,0,0) and the others set with (--pfc=0,0,0,0,0,0,0,0):
    On the nodes with pfc=0,0,0,1,0,0,0,0 they display both pause and CNP.

I wonder if it’s something related to ECE negotiation. Do you have any idea why?

Hi,
The answer is more complex than it seems.
It might be related to FW version. Which one do you use?
It might be related to RoCE accelerations enabled (which you are not aware of). Please check [1] for how to review those (slow_restart). It will generate artificial CNPs (not due to congestion) to identify packet drops and convey it to the sender. If you have packet drops and CNPs are sent and handled it will act when PFC is not enabled. If PFC is enabled I do not expect these CNPs to be present.
#3 is NOT a valid test.
I suggest to open a case with Nvidia technical support if the above does not answer your questions.
Regards,
Yaniv

[1] https://enterprise-support.nvidia.com/s/article/How-to-Enable-Disable-Lossy-RoCE-Accelerations

Thanks so much for your help,
Currently my FW version is 22.35.2000.
It seems that I don’t have access to view the link [1] you posted. Sorry but could you please provide in some other format (such as pdf)?

Seems like the server is down.
See below how to read current state and disable slow_restart.
Regards,
Yaniv

mlxreg -d 0c:00.0 --yes --reg_name ROCE_ACCL --get

mlxreg -d 0c:00.0 --yes --reg_name ROCE_ACCL --set “roce_slow_restart_en=0x0“

I made roce_slow_restart_en=0x0, and under mlnx_qos conditions (--pfc=0,0,0,1,0,0,0,0 --trust=dscp) I saw that roce_slow_restart_cnp indeed stopped increasing.

Still, there is no np_cnp_sent, rp_cnp_handled increment, ethtool doesn’t display any prio6 tx (which is the one we set for CNP packets).

That is what I would expect.
You do NOT have a congestion in the network. Switch is NOT marking ECN bits and that is why CNPs are NOT sent from the receiver (on prio6)
Seems to me like the receiver is slow and hence backpressure is happening and goes down to the network …
Do you really have congestion in your environment? (e.g., 2 senders to 1 receiver)
To identify if you have congestion in the switch review this [1] counter on the receiver.
You can also read more explanations in this post: Connectx-6 DX card sending CNP even when there is no ECN marked ROCE traffic from switches
Regards,
Yaniv

/sys/class/infiniband/<mlx5_X>/ports/1/hw_counters/np_ecn_marked_roce_packets

Hi, @yanivserlin Thanks for quick and kind reply.

Yes we do have congestion scenarios with overlapped traffics. (there’s also PFC Pause on Switches and NICs)

Also, we have switch configuration with Kmin<Kmax<Xon<Xoff. (latter two are PFC pause/resume thresholds)

Thus I believe that if PAUSE frame exists, there must be ECN marking and CNP generation also. What do you think?

I will further observe about np_ecn_marked_roce_packets, I’ll be in touch

However, Does rp_cnp_handled consists counter or “real CNPs” by congestion and “Artificial CNPs” by ROCE_ACCL?

Also PFC on prio 3 + DCQCN is most recommended solution in general ?, or just using one of them?

Best Regards
Taekyoung

The switch is responsible to “identify” congestion and that is visible with the np_ecn_marked_roce_packets. If you do not see that, either there is no congestion or the switch is not configured correctly.
rp_cnp_handled consists of both.
PFC + DCQCN is one way to provision RoCE. Different customers do different things. PFC+CC, PFC only, CC only PFC on first layer switches +CC, Nothing + handle in application (used usually in long distance RoCE). For each there are implications and constraints so it is hard for me to recommend without prior knowledge of your network/scale/application/…
The most basic recommendation would be PFC+CC.
Regards,
Yaniv

Hi @yanivserlin @xiaofengl

Here’re our observation

First, we believed ECN of NICs was enabled due to ecn enabled for every priorities as below

$ sudo cat /sys/class/net/ens4f0np0/ecn/roce_rp/enable/*
1
1
1
1
1
1
1
1
$ sudo cat /sys/class/net/ens4f0np0/ecn/roce_np/enable/*
1
1
1
1
1
1
1
1

Scenario: 2:1 traffic contention on a same link makes congestion (we can observe PFC pause on switch)

Case 1) Switch: ECN, PFC Enabled / NICs: ECN, RoCE_ACCL Enabled
Packet drops at switch port and congestion collapse (bandwidth goes almosts zero) occured:
No congestion control behavior on NICs and also no np_ecn_marked_roce_packets but there’s some RoCE_ACCL counters like roce_slow_restart_cnps and roce_slow_restart_cnps

Case 2) Switch: ECN, PFC Enabled / NICs: ECN Enabeld, RoCE_ACCL Disabled
Packet drops at switch port and congestion collapse occured:
No congestion control behavior on NICs and also no np_ecn_marked_roce_packets also there’s no RoCE_ACCL counters like roce_slow_restart_cnps and roce_slow_restart_cnps

Case 3) Switch: ECN, PFC Enabled / NICs: ECN, PFC Enabeld, RoCE_ACCL Disabled
No Packet drop at switch port and no congestion collapse (2 flows obtained half of bandwdith):
Also there’s no congestion control behavior on NICs and also no np_ecn_marked_roce_packets also there’s no RoCE_ACCL counters like roce_slow_restart_cnps and roce_slow_restart_cnps

However, we’ve found that our NIC’s packet has ECT(0) like below

06:29:38.125779 IP (tos 0x62,ECT(0), ttl 64, id 44056, offset 0, flags [DF], proto UDP (17), length 1068)
    10.10.20.22.50508 > nxc-node0.4791: UDP, length 1040

Refer to Cisco switch’s doc, it drops packets with CE(1), ECT (0).

And only sends ECN-marked packets with ECT (1).

So we believe every problem comes from ECT(0) of end NICs.

How can we modify ECT(0) to ECT(1)? This should be possible to operate DCQCN w/o trouble.

Also, in months ago I indeed saw enough np_ecn_marked_roce_packets was received (obviously, more than np_cnp_sent counts). I guess some configurations changed since then, messing up the environment.

I did re-installed firmware and ofed several times and it still doesn’t receive np_ecn_marked_roce_packets normally.

We used tc_wrap.py before, does it matter about this situation?

Best Regards
Taekyoung

Hi,
I’m not aware that CISCO switches are limited to ECT(1). I have multiple customers using CISCO switches with this config.
tc_wrap.py is related to PCP based PFC and is not required here.
I suggest focusing your experiments on the DCQCN enable only with all the rest disabled. Also, double confirm the switch is configured correctly to mark Congestion Encountered bit.
Are you using any encapsulation? If so, there are other considerations to take care of.
One thing I noticed is that you use a very old FW version. I suggest upgrading although I do not expect it to address this issue.
If you still encounter issues I suggest opening a case with technical support as we will need more debug data collected from the systems to be analyzed.
Regards,
Yaniv

Thanks, I’ll do it further more.

Regards,
Taekyoung