SX6036 with vlans - PFC not working.

I have followed the guide almost letter to letter, but somehow, I can’t establish a loseless link. It’s all fine and nice with “global flow control”, but PFC doesn’t seem to be working.

I see the receiver-host sending pause frames ( tx_pause_prio_3: 49732) and those frames being received by switch ( 66498 pause packets), and then pause packets generated on switch’s sender-host interface (24309 pause packets), but when checking sender’s linux interface stats, the pause packets counter is zero (rx_pause_prio_3: 0). I’m also seeing ( 93970 discard packets) on RX part of sender-host switch interface counter output.

I’m highly perplexed on what is wrong with all this setup… I’m using standard EL7 driver, using vlans with egress-qos-map configured. pfc*x params are set to 0x08 (and the counters output above was all fetched from prio 3 – thus the config was supposed to be correct).

It works totally as expected when used with global flow control – I see pause frames being sent and received as expected on both linux hosts.

Adapters are connectX-3 pro.

Another thing – when executing tcp_wrap as per docs, I’m getting:


# /usr/src/tc_wrap.py -i eth5 -u 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3

skprio2up is availabe only for RoCE in kernels that don't support set_egress_map

Tarrfic classes are set to 8

UP 0

UP 1

UP 2

UP 3

skprio: 0

skprio: 1

skprio: 2 (tos: 8)

skprio: 3

skprio: 4 (tos: 24)

skprio: 5

skprio: 6 (tos: 16)

skprio: 7

skprio: 8

skprio: 9

skprio: 10

skprio: 11

skprio: 12

skprio: 13

skprio: 14

skprio: 15

skprio: 0 (vlan 300)

skprio: 1 (vlan 300)

skprio: 2 (vlan 300 tos: 8)

skprio: 3 (vlan 300)

skprio: 4 (vlan 300 tos: 24)

skprio: 5 (vlan 300)

skprio: 6 (vlan 300 tos: 16)

skprio: 7 (vlan 300)

UP 4

UP 5

UP 6

UP 7

# /usr/src/tc_wrap.py -i eth5

Tarrfic classes are set to 8

UP 0

UP 1

UP 2

UP 3

skprio: 0 (vlan 300)

skprio: 1 (vlan 300)

skprio: 2 (vlan 300 tos: 8)

skprio: 3 (vlan 300)

skprio: 4 (vlan 300 tos: 24)

skprio: 5 (vlan 300)

skprio: 6 (vlan 300 tos: 16)

skprio: 7 (vlan 300)

UP 4

UP 5

UP 6

UP 7

Running stock EL7 kernel. Is this expected?

Have tested with several adapters, it seems this is happening on connectX-3 Pro only, HP_1370110017 model. It works on regular connectX-3 (HP model, too). Have tried several firmwares, different kernels, installing mellanox OFED, yet still rx_pause_prio_* counters are zero, despite switch sending pause packets like crazy. Is this some kind of problem with hardware? How come entire 3Pro line is defective in this regard guys? Any workarounds? Much appreciated. Thanks.

Hello!

You’ve done some good testing/validation so far and have come across some behavior that is both reproducible and seems like it may be specific to a given hw/sw release of the product.

At this point the recommendation would be to open a support case so that our support engineers can help you triage this further at a detailed level.

Please open up a support ticket here so we can dig more deeply into this:

https://support.mellanox.com/s/

Thanks!

What can be said… After spending several days troubleshooting this and after updating to latest available version (3.6.8010), I was able to get it going. But this thing is SO fragile, I can’t even stress enough how fragile it is. You have to execute config commands in specific sequence in order to get ports going on switch side.

It’s not about adapters, it’s about the switch!

Thanks everyone.

Well, you also have to reboot the adapters after you had configured the PFC on corresponding switch ethernet interface. This is weird, but it worked for us.