PFC with ConnectX-5

I’m trying to get RoCE v1 working with ConnectX-5 100G Ethernet adapters. I have ib_send_bw working with good bandwidth, but things seem to fall apart with OpenMPI jobs with multiple MPI tasks per node, most certainly because I don’t have flow control working properly yet. These adapters use the mlx5 drivers so it doesn’t appear that the mlx4_en kernel module options are available (pfctx/pfcrx).

I’m at a loss how to make progress. If I try to configure things manually then mlx_qos and ethtool seem to wipe out the effect of the other:

[me@mine]# mlnx_qos -i eth4 -f 1,1,1,1,1,1,1,1

PFC configuration:

priority 0 1 2 3 4 5 6 7

enabled 1 1 1 1 1 1 1 1

tc: 0 ratelimit: unlimited, tsa: vendor

priority: 1

tc: 1 ratelimit: unlimited, tsa: vendor

priority: 0

tc: 2 ratelimit: unlimited, tsa: vendor

priority: 2

tc: 3 ratelimit: unlimited, tsa: vendor

priority: 3

tc: 4 ratelimit: unlimited, tsa: vendor

priority: 4

tc: 5 ratelimit: unlimited, tsa: vendor

priority: 5

tc: 6 ratelimit: unlimited, tsa: vendor

priority: 6

tc: 7 ratelimit: unlimited, tsa: vendor

priority: 7

[me@mine]# ethtool -A eth4 rx on

[me@mine]# ethtool -A eth4 tx on

[me@mine]# ethtool -a eth4

Pause parameters for eth4:

Autonegotiate: off

RX: on

TX: on

[me@mine]# mlnx_qos -i eth4

PFC configuration:

priority 0 1 2 3 4 5 6 7

enabled 0 0 0 0 0 0 0 0

tc: 0 ratelimit: unlimited, tsa: vendor

priority: 1

tc: 1 ratelimit: unlimited, tsa: vendor

priority: 0

tc: 2 ratelimit: unlimited, tsa: vendor

priority: 2

tc: 3 ratelimit: unlimited, tsa: vendor

priority: 3

tc: 4 ratelimit: unlimited, tsa: vendor

priority: 4

tc: 5 ratelimit: unlimited, tsa: vendor

priority: 5

tc: 6 ratelimit: unlimited, tsa: vendor

priority: 6

tc: 7 ratelimit: unlimited, tsa: vendor

priority: 7

[me@mine]# mlnx_qos -i eth4 -f 1,1,1,1,1,1,1,1

PFC configuration:

priority 0 1 2 3 4 5 6 7

enabled 1 1 1 1 1 1 1 1

tc: 0 ratelimit: unlimited, tsa: vendor

priority: 1

tc: 1 ratelimit: unlimited, tsa: vendor

priority: 0

tc: 2 ratelimit: unlimited, tsa: vendor

priority: 2

tc: 3 ratelimit: unlimited, tsa: vendor

priority: 3

tc: 4 ratelimit: unlimited, tsa: vendor

priority: 4

tc: 5 ratelimit: unlimited, tsa: vendor

priority: 5

tc: 6 ratelimit: unlimited, tsa: vendor

priority: 6

tc: 7 ratelimit: unlimited, tsa: vendor

priority: 7

[me@mine]# ethtool -a eth4

Pause parameters for eth4:

Autonegotiate: off

RX: off

TX: off

Thanks Sophie,

Yes, I’ve read the DOC-2474 and several things are confusing.

First, I never see socket priorities getting mapped to user priorities or traffic classes as shown in the example outputs from mlx_qos and tc_wrap.py:

mlnx_qos -i eth2 -f 0,0,0,1,0,0,0,0

Priority trust mode: pcp

PFC configuration:

priority 0 1 2 3 4 5 6 7

enabled 0 0 0 1 0 0 0 0

tc: 0 ratelimit: unlimited, tsa: vendor

priority: 0

tc: 1 ratelimit: unlimited, tsa: vendor

priority: 1

tc: 2 ratelimit: unlimited, tsa: vendor

priority: 2

tc: 3 ratelimit: unlimited, tsa: vendor

priority: 3

tc: 4 ratelimit: unlimited, tsa: vendor

priority: 4

tc: 5 ratelimit: unlimited, tsa: vendor

priority: 5

tc: 6 ratelimit: unlimited, tsa: vendor

priority: 6

tc: 7 ratelimit: unlimited, tsa: vendor

priority: 7

tc_wrap.py -i eth2

Traffic classes are set to 8

UP 0

UP 1

UP 2

UP 3

UP 4

UP 5

UP 6

UP 7

And the section “Set Egress Mapping on Kernel Bypass Traffic (RoCE)” indicates to use tc_wrap.py to set the RoCE mapping. But doesn’t tc just set mappings for the kernel packet scheduler? RoCE bypasses the kernel, right?

Regardless, if I try to use tc_warp I get the following error message and my confiiguration doeosn’t appear to take effect:

tc_wrap.py -i eth2 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4

skprio2up is available only for RoCE in kernels that don’t support set_egress_map

Traffic classes are set to 8

UP 0

UP 1

UP 2

UP 3

UP 4

skprio: 0

skprio: 1

skprio: 2 (tos: 8)

skprio: 3

skprio: 4 (tos: 24)

skprio: 5

skprio: 6 (tos: 16)

skprio: 7

skprio: 8

skprio: 9

skprio: 10

skprio: 11

skprio: 12

skprio: 13

skprio: 14

skprio: 15

UP 5

UP 6

UP 7

tc_wrap.py -i eth2

Traffic classes are set to 8

UP 0

UP 1

UP 2

UP 3

UP 4

UP 5

UP 6

UP 7

Hi Ricky,

Global Pause is being turned on using the ethtool -A.

PFC (Priority Flow Control) is configured with mlnx_qos on the host.

You have to choose one or another, no both at the same time.

What I would recommend first is to make sure the servers are being appropriately tuned (Basic start up) according to the Community Doc:

Getting started with Performance Tuning of Mellanox adapters

https://community.mellanox.com/s/article/getting-started-with-performance-tuning-of-mellanox-adapters

Then I would test again.

Also, you can first test with GP (Global Pause) and compare as well with PFC.

You can consult this document below to properly configure PFC on ConnectX (applicable to Connectx-5)

https://community.mellanox.com/s/article/howto-configure-pfc-on-connectx-4

Cheers,

Sophie.