Setting service level in ib_send_bw doesn't increase expected priority counters in ethtool

HI,

I’m trying to get RoCE going in a small cluster of 10 nodes, each with 2x Connect-X5 NICs running at 100 GbE though a CISCO switch (sorry I don’t make the purchasing decisions).

My main problem is that I get rx_discard_phy increasing when I run ib_send_bw on 2 hosts funneling into one host. I think this is because PFC isn’t working.

We’ve configured (we think) TC=2 and TC=3 to be PFC / no drop on the switch. We’ve run mlx_qos as below.

While debugging the main problem, I think I’ve found a smaller one: if I set the service level with ib_send_bw to 3 on the transmitter, the tx_prio0_bytes increase in ethtool -S as shown below. I expected prio3 bytes to increase. They stay stubbornly at zero.

What have I missed?

Receiver command:

ib_send_bw -F -d mlx5_0 -p 18516 --report_gbits -D 1 -c UC -s 30000000 -x 0 --run_infinitely -D 1 -S 3

Transmitter:

ib_send_bw -F -d mlx5_0 -p 18516 --report_gbits -c UC -s 30000000 -x 0 10.25.11.31 --rate_limit 5 -D 10 --rate_limit_type=SW -S 3

$ ethtool -S enp175s0 | grep prio
     rx_prio0_bytes: 121058123373117
     rx_prio0_packets: 29066118338
     tx_prio0_bytes: 230110655124606
     tx_prio0_packets: 55254224316
     rx_prio1_bytes: 0
     rx_prio1_packets: 0
     tx_prio1_bytes: 0
     tx_prio1_packets: 0
     rx_prio2_bytes: 0
     rx_prio2_packets: 0
     tx_prio2_bytes: 0
     tx_prio2_packets: 0
     rx_prio3_bytes: 0
     rx_prio3_packets: 0
     tx_prio3_bytes: 0
     tx_prio3_packets: 0
     rx_prio4_bytes: 0
     rx_prio4_packets: 0
     tx_prio4_bytes: 0
     tx_prio4_packets: 0
     rx_prio5_bytes: 0
     rx_prio5_packets: 0
     tx_prio5_bytes: 0
     tx_prio5_packets: 0
     rx_prio6_bytes: 0
     rx_prio6_packets: 0
     tx_prio6_bytes: 0
     tx_prio6_packets: 0
     rx_prio7_bytes: 0
     rx_prio7_packets: 0
     tx_prio7_bytes: 0
     tx_prio7_packets: 0
     rx_prio3_pause: 0
     rx_prio3_pause_duration: 0
     tx_prio3_pause: 23434
     tx_prio3_pause_duration: 2568751
     rx_prio3_pause_transition: 0

QoS RX it looks like this:

$ mlnx_qos -i enp175s0
/usr/bin/mlnx_qos:545: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  prio2buffer, buffer_size, tot_size = ctrl.get_ieee_dcb_buffer()
/usr/bin/mlnx_qos:637: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  prio_tc, tsa, tc_bw = ctrl.get_ieee_ets()
/usr/bin/mlnx_qos:638: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  pfc_en = ctrl.get_ieee_pfc_en()
/usr/bin/mlnx_qos:639: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  pfc_delay = ctrl.get_ieee_pfc_delay()
DCBX mode: OS controlled
Priority trust state: pcp
default priority:
Receive buffer size (bytes): 262016,262016,0,0,0,0,0,0,total_size=524160
Cable len: 7
PFC configuration:
	priority    0   1   2   3   4   5   6   7
	enabled     0   0   1   1   0   0   0   0
	buffer      0   0   1   1   0   0   0   0
tc: 0 ratelimit: unlimited, tsa: strict
	 priority:  0
tc: 1 ratelimit: unlimited, tsa: strict
	 priority:  1
tc: 2 ratelimit: unlimited, tsa: strict
	 priority:  2
tc: 3 ratelimit: unlimited, tsa: strict
	 priority:  3
tc: 4 ratelimit: unlimited, tsa: strict
	 priority:  4
tc: 5 ratelimit: unlimited, tsa: strict
	 priority:  5
tc: 6 ratelimit: unlimited, tsa: strict
	 priority:  6
tc: 7 ratelimit: unlimited, tsa: strict
	 priority:  7

QoS the sender it looks like this:

$ mlnx_qos -i enp175s0
/usr/bin/mlnx_qos:545: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  prio2buffer, buffer_size, tot_size = ctrl.get_ieee_dcb_buffer()
/usr/bin/mlnx_qos:637: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  prio_tc, tsa, tc_bw = ctrl.get_ieee_ets()
/usr/bin/mlnx_qos:638: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  pfc_en = ctrl.get_ieee_pfc_en()
/usr/bin/mlnx_qos:639: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
  pfc_delay = ctrl.get_ieee_pfc_delay()
DCBX mode: OS controlled
Priority trust state: pcp
default priority:
Receive buffer size (bytes): 262016,262016,0,0,0,0,0,0,total_size=524160
Cable len: 7
PFC configuration:
	priority    0   1   2   3   4   5   6   7
	enabled     0   0   0   1   0   0   0   0
	buffer      0   0   0   1   0   0   0   0
tc: 0 ratelimit: unlimited, tsa: strict
	 priority:  0
tc: 1 ratelimit: unlimited, tsa: strict
	 priority:  1
tc: 2 ratelimit: unlimited, tsa: strict
	 priority:  2
tc: 3 ratelimit: unlimited, tsa: strict
	 priority:  3
tc: 4 ratelimit: unlimited, tsa: strict
	 priority:  4
tc: 5 ratelimit: unlimited, tsa: strict
	 priority:  5
tc: 6 ratelimit: unlimited, tsa: strict
	 priority:  6
tc: 7 ratelimit: unlimited, tsa: strict
	 priority:  7
$ uname -a
Linux seren-01 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux
$ ofed_info
MLNX_OFED_LINUX-5.5-1.0.3.2 (OFED-5.5-1.0.3)
...

TC wrap looks like this:

sudo /usr/bin/python2.7 `which tc_wrap.py` ./tc_wrap.py -i enp175s0 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
skprio2up is available only for RoCE in kernels that don't support set_egress_map
Traffic classes are set to 8
UP  0
UP  1
UP  2
UP  3
        skprio: 0
        skprio: 1
        skprio: 2 (tos: 8)
        skprio: 3
        skprio: 4 (tos: 24)
        skprio: 5
        skprio: 6 (tos: 16)
        skprio: 7
        skprio: 8
        skprio: 9
        skprio: 10
        skprio: 11
        skprio: 12
        skprio: 13
        skprio: 14
        skprio: 15
UP  4
UP  5
UP  6
UP  7

Hello,

Have you validated that PFC has been configured point to point? I am referring to PFC properly configured on the switch(s) ports as PFC and not GP (Global pause). Have you checked the switch(s) counters from the switch ports to check RX/TX which priority the traffic is being sent/received? Verify that PFC is properly enabled on the switch and validate the traffic priority via config & counters.

Sophie.