ConnectX5 ASAP2 VXLAN offload + bond + openstack problem

Hello dear Mellanox community, I have a weird problem I would like to ask, maybe someone else had the same problem before and can help me.

I have 2 servers, ConnectX-5 dual port cards in them. I’ve set up ASAP2 with vxlan offloading and OVS using only one out of the 2 interfaces, works well, no problems. My Openstack instances reach close-to-wire performance. Both directions are being offloaded, working absolutely fine.

Because of high availability, I would like to set up bonding with 2 interfaces and using offloading still.

I was reading this docs: http://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf

On page 15, it says active-backup, active-active, LACP also working.

I have active-backup bond set on the host, all good, vxlan tunnel comes up using the bon0’s IP address. I have installed the OFED driver and prepared the cards on boot time before the bond0 comes up:

echo ‘4’ > /sys/class/net/enp129s0f0/device/sriov_numvfs

echo 0000:81:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind

echo 0000:81:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind

echo 0000:81:00.4 > /sys/bus/pci/drivers/mlx5_core/unbind

echo 0000:81:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind

devlink dev eswitch set pci/0000:81:00.0 mode switchdev

ethtool -K enp129s0f0 hw-tc-offload on

echo 0000:81:00.2 > /sys/bus/pci/drivers/mlx5_core/bind

echo 0000:81:00.3 > /sys/bus/pci/drivers/mlx5_core/bind

echo 0000:81:00.4 > /sys/bus/pci/drivers/mlx5_core/bind

echo 0000:81:00.5 > /sys/bus/pci/drivers/mlx5_core/bind

echo ‘4’ > /sys/class/net/enp129s0f1/device/sriov_numvfs

echo 0000:81:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind

echo 0000:81:01.3 > /sys/bus/pci/drivers/mlx5_core/unbind

echo 0000:81:01.4 > /sys/bus/pci/drivers/mlx5_core/unbind

echo 0000:81:01.5 > /sys/bus/pci/drivers/mlx5_core/unbind

devlink dev eswitch set pci/0000:81:00.1 mode switchdev

ethtool -K enp129s0f1 hw-tc-offload on

echo 0000:81:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

echo 0000:81:01.3 > /sys/bus/pci/drivers/mlx5_core/bind

echo 0000:81:01.4 > /sys/bus/pci/drivers/mlx5_core/bind

echo 0000:81:01.5 > /sys/bus/pci/drivers/mlx5_core/bind

So I think at this point the internal eswitch should be ready on both ports.

Sadly the doc is not super detailed but I noticed this example:

ovs-vsctl add-port ovs-sriov enp4s0f0_0

ovs-vsctl add-port ovs-sriov enp4s0f1_0

In this example they add a representor port from both ports to the OVS.

Does this mean I got to add both of those representor ports to my vm? If yes, what kind of nova filter rule makes that possible? I don’t know tbh.

Anyway I proceeded adding only one port to my vm and I get duplicated packets when I ping between 2 offloaded vm-s:

root@vxlan-test1:/home/ubuntu# ping 192.168.60.3

PING 192.168.60.3 (192.168.60.3) 56(84) bytes of data.

64 bytes from 192.168.60.3: icmp_seq=1 ttl=64 time=44.6 ms

64 bytes from 192.168.60.3: icmp_seq=2 ttl=64 time=0.227 ms

64 bytes from 192.168.60.3: icmp_seq=2 ttl=64 time=0.266 ms (DUP!)

64 bytes from 192.168.60.3: icmp_seq=3 ttl=64 time=0.157 ms

64 bytes from 192.168.60.3: icmp_seq=3 ttl=64 time=0.219 ms (DUP!)

64 bytes from 192.168.60.3: icmp_seq=4 ttl=64 time=0.199 ms

64 bytes from 192.168.60.3: icmp_seq=5 ttl=64 time=0.212 ms

Also when I do packet capture I see on the representor port:

root@compute-05:/home/ubuntu# tcpdump -nnn -i enp129s0f0_3

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on enp129s0f0_3, link-type EN10MB (Ethernet), capture size 262144 bytes

07:47:58.405728 IP 192.168.60.15 > 192.168.60.3: ICMP echo request, id 1436, seq 1, length 64

07:47:58.454229 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 1, length 64

07:47:59.407672 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 2, length 64

07:48:01.452416 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 4, length 64

07:48:02.476327 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 5, length 64

07:48:03.468085 ARP, Request who-has 192.168.60.3 tell 192.168.60.15, length 46

07:48:03.479822 ARP, Reply 192.168.60.3 is-at fa:16:3e:ef:73:e1, length 46

07:48:03.491375 ARP, Request who-has 192.168.60.15 tell 192.168.60.3, length 46

07:48:03.491514 ARP, Reply 192.168.60.15 is-at fa:16:3e:d3:c0:7d, length 46

That means only the requests are being offloaded and the replies are not.

At this point because of the lack of documentation I am pretty much out of ideas what should I do. Anyone else made bonding with asap2 work with ovs and neutron? What am I missing?

Everything works just fine when not using a bond.

Any help would be appreciated! :)

Thanks,

Zoltan

Hi Zoltan,

Please note that ASAP2 is officially supported in Ubuntu 18.04.02 + 4.15 kernel with MLNX_OFED 4.6.

There is a known issue - only one direction is offloaded with VF LAG, which has been fixed in the latest ovs release (v2.11.90).

To install ovs v2.11.90, please git clone from the ovs upstream source repository.

The latest ovs release is 2.11.1, therefore you have to clone the ovs source code and build the binary.

Regards,

Chen

Hello Chen,

I’ve done what you said and built ovs from the master source.

ovs_version: “2.11.90” on both of my test machines.

root@compute-05:/home/ubuntu# ovs-dpctl show

system@ovs-system:

lookups: hit:40021 missed:210408 lost:71

flows: 2886

masks: hit:1618069 total:6 hit/pkt:6.46

port 0: ovs-system (internal)

port 1: sg-5f2cc2e6-00 (internal)

port 2: sg-f2217884-44 (internal)

port 3: qg-f59ef8e4-ca (internal)

port 4: enp129s0f0_3

port 5: qr-aff24f76-8d (internal)

port 6: qr-cab3793d-ad (internal)

port 7: qr-a95e91c3-65 (internal)

port 8: fg-c29cccbe-a6 (internal)

port 9: sg-48ab65d3-bf (internal)

port 10: br-int (internal)

port 11: qr-a56adfbd-ab (internal)

port 12: vxlan_sys_4789 (vxlan: packet_type=ptap)

port 13: br-tun (internal)

port 14: bond0.2410

port 15: br-ex (internal)

port 16: qvo00145608-5c

port 17: qvoe15dcb66-73

My representor port port 4: enp129s0f0_3 is there.

root@compute-05:/home/ubuntu# cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer2 (0)

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

My bond0 interface is up.

I still get this result on tcpdump:

root@compute-05:/home/ubuntu# tcpdump -i enp129s0f0_3

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on enp129s0f0_3, link-type EN10MB (Ethernet), capture size 262144 bytes

12:39:47.271830 IP 192.0.2.3 > 192.0.2.20: ICMP echo request, id 2014, seq 1, length 64

12:39:47.338971 IP 192.0.2.20 > 192.0.2.3: ICMP echo reply, id 2014, seq 1, length 64

12:39:48.273376 IP 192.0.2.20 > 192.0.2.3: ICMP echo reply, id 2014, seq 2, length 64

12:39:49.299865 IP 192.0.2.20 > 192.0.2.3: ICMP echo reply, id 2014, seq 3, length 64

12:39:50.323863 IP 192.0.2.20 > 192.0.2.3: ICMP echo reply, id 2014, seq 4, length 64

12:39:51.347916 IP 192.0.2.20 > 192.0.2.3: ICMP echo reply, id 2014, seq 5, length 64

What am I missing?

Thank you in advance!

Zoltan

Also we get this error from the driver. Is this normal?