HI,
I’m trying to get RoCE going in a small cluster of 10 nodes, each with 2x Connect-X5 NICs running at 100 GbE though a CISCO switch (sorry I don’t make the purchasing decisions).
My main problem is that I get rx_discard_phy increasing when I run ib_send_bw on 2 hosts funneling into one host. I think this is because PFC isn’t working.
We’ve configured (we think) TC=2 and TC=3 to be PFC / no drop on the switch. We’ve run mlx_qos as below.
While debugging the main problem, I think I’ve found a smaller one: if I set the service level with ib_send_bw to 3 on the transmitter, the tx_prio0_bytes increase in ethtool -S as shown below. I expected prio3 bytes to increase. They stay stubbornly at zero.
What have I missed?
Receiver command:
ib_send_bw -F -d mlx5_0 -p 18516 --report_gbits -D 1 -c UC -s 30000000 -x 0 --run_infinitely -D 1 -S 3
Transmitter:
ib_send_bw -F -d mlx5_0 -p 18516 --report_gbits -c UC -s 30000000 -x 0 10.25.11.31 --rate_limit 5 -D 10 --rate_limit_type=SW -S 3
$ ethtool -S enp175s0 | grep prio
rx_prio0_bytes: 121058123373117
rx_prio0_packets: 29066118338
tx_prio0_bytes: 230110655124606
tx_prio0_packets: 55254224316
rx_prio1_bytes: 0
rx_prio1_packets: 0
tx_prio1_bytes: 0
tx_prio1_packets: 0
rx_prio2_bytes: 0
rx_prio2_packets: 0
tx_prio2_bytes: 0
tx_prio2_packets: 0
rx_prio3_bytes: 0
rx_prio3_packets: 0
tx_prio3_bytes: 0
tx_prio3_packets: 0
rx_prio4_bytes: 0
rx_prio4_packets: 0
tx_prio4_bytes: 0
tx_prio4_packets: 0
rx_prio5_bytes: 0
rx_prio5_packets: 0
tx_prio5_bytes: 0
tx_prio5_packets: 0
rx_prio6_bytes: 0
rx_prio6_packets: 0
tx_prio6_bytes: 0
tx_prio6_packets: 0
rx_prio7_bytes: 0
rx_prio7_packets: 0
tx_prio7_bytes: 0
tx_prio7_packets: 0
rx_prio3_pause: 0
rx_prio3_pause_duration: 0
tx_prio3_pause: 23434
tx_prio3_pause_duration: 2568751
rx_prio3_pause_transition: 0
QoS RX it looks like this:
$ mlnx_qos -i enp175s0
/usr/bin/mlnx_qos:545: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
prio2buffer, buffer_size, tot_size = ctrl.get_ieee_dcb_buffer()
/usr/bin/mlnx_qos:637: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
prio_tc, tsa, tc_bw = ctrl.get_ieee_ets()
/usr/bin/mlnx_qos:638: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
pfc_en = ctrl.get_ieee_pfc_en()
/usr/bin/mlnx_qos:639: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
pfc_delay = ctrl.get_ieee_pfc_delay()
DCBX mode: OS controlled
Priority trust state: pcp
default priority:
Receive buffer size (bytes): 262016,262016,0,0,0,0,0,0,total_size=524160
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 1 1 0 0 0 0
buffer 0 0 1 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: strict
priority: 0
tc: 1 ratelimit: unlimited, tsa: strict
priority: 1
tc: 2 ratelimit: unlimited, tsa: strict
priority: 2
tc: 3 ratelimit: unlimited, tsa: strict
priority: 3
tc: 4 ratelimit: unlimited, tsa: strict
priority: 4
tc: 5 ratelimit: unlimited, tsa: strict
priority: 5
tc: 6 ratelimit: unlimited, tsa: strict
priority: 6
tc: 7 ratelimit: unlimited, tsa: strict
priority: 7
QoS the sender it looks like this:
$ mlnx_qos -i enp175s0
/usr/bin/mlnx_qos:545: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
prio2buffer, buffer_size, tot_size = ctrl.get_ieee_dcb_buffer()
/usr/bin/mlnx_qos:637: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
prio_tc, tsa, tc_bw = ctrl.get_ieee_ets()
/usr/bin/mlnx_qos:638: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
pfc_en = ctrl.get_ieee_pfc_en()
/usr/bin/mlnx_qos:639: DeprecationWarning: fromstring() is deprecated. Use frombytes() instead.
pfc_delay = ctrl.get_ieee_pfc_delay()
DCBX mode: OS controlled
Priority trust state: pcp
default priority:
Receive buffer size (bytes): 262016,262016,0,0,0,0,0,0,total_size=524160
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: strict
priority: 0
tc: 1 ratelimit: unlimited, tsa: strict
priority: 1
tc: 2 ratelimit: unlimited, tsa: strict
priority: 2
tc: 3 ratelimit: unlimited, tsa: strict
priority: 3
tc: 4 ratelimit: unlimited, tsa: strict
priority: 4
tc: 5 ratelimit: unlimited, tsa: strict
priority: 5
tc: 6 ratelimit: unlimited, tsa: strict
priority: 6
tc: 7 ratelimit: unlimited, tsa: strict
priority: 7
$ uname -a
Linux seren-01 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux
$ ofed_info
MLNX_OFED_LINUX-5.5-1.0.3.2 (OFED-5.5-1.0.3)
...
TC wrap looks like this:
sudo /usr/bin/python2.7 `which tc_wrap.py` ./tc_wrap.py -i enp175s0 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
skprio2up is available only for RoCE in kernels that don't support set_egress_map
Traffic classes are set to 8
UP 0
UP 1
UP 2
UP 3
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
UP 4
UP 5
UP 6
UP 7