Hello dear Mellanox community, I have a weird problem I would like to ask, maybe someone else had the same problem before and can help me.
I have 2 servers, ConnectX-5 dual port cards in them. I’ve set up ASAP2 with vxlan offloading and OVS using only one out of the 2 interfaces, works well, no problems. My Openstack instances reach close-to-wire performance. Both directions are being offloaded, working absolutely fine.
Because of high availability, I would like to set up bonding with 2 interfaces and using offloading still.
I was reading this docs: http://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf
On page 15, it says active-backup, active-active, LACP also working.
I have active-backup bond set on the host, all good, vxlan tunnel comes up using the bon0’s IP address. I have installed the OFED driver and prepared the cards on boot time before the bond0 comes up:
echo ‘4’ > /sys/class/net/enp129s0f0/device/sriov_numvfs
echo 0000:81:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:81:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:81:00.4 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:81:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind
devlink dev eswitch set pci/0000:81:00.0 mode switchdev
ethtool -K enp129s0f0 hw-tc-offload on
echo 0000:81:00.2 > /sys/bus/pci/drivers/mlx5_core/bind
echo 0000:81:00.3 > /sys/bus/pci/drivers/mlx5_core/bind
echo 0000:81:00.4 > /sys/bus/pci/drivers/mlx5_core/bind
echo 0000:81:00.5 > /sys/bus/pci/drivers/mlx5_core/bind
echo ‘4’ > /sys/class/net/enp129s0f1/device/sriov_numvfs
echo 0000:81:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:81:01.3 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:81:01.4 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:81:01.5 > /sys/bus/pci/drivers/mlx5_core/unbind
devlink dev eswitch set pci/0000:81:00.1 mode switchdev
ethtool -K enp129s0f1 hw-tc-offload on
echo 0000:81:01.2 > /sys/bus/pci/drivers/mlx5_core/bind
echo 0000:81:01.3 > /sys/bus/pci/drivers/mlx5_core/bind
echo 0000:81:01.4 > /sys/bus/pci/drivers/mlx5_core/bind
echo 0000:81:01.5 > /sys/bus/pci/drivers/mlx5_core/bind
So I think at this point the internal eswitch should be ready on both ports.
Sadly the doc is not super detailed but I noticed this example:
ovs-vsctl add-port ovs-sriov enp4s0f0_0
ovs-vsctl add-port ovs-sriov enp4s0f1_0
In this example they add a representor port from both ports to the OVS.
Does this mean I got to add both of those representor ports to my vm? If yes, what kind of nova filter rule makes that possible? I don’t know tbh.
Anyway I proceeded adding only one port to my vm and I get duplicated packets when I ping between 2 offloaded vm-s:
root@vxlan-test1:/home/ubuntu# ping 192.168.60.3
PING 192.168.60.3 (192.168.60.3) 56(84) bytes of data.
64 bytes from 192.168.60.3: icmp_seq=1 ttl=64 time=44.6 ms
64 bytes from 192.168.60.3: icmp_seq=2 ttl=64 time=0.227 ms
64 bytes from 192.168.60.3: icmp_seq=2 ttl=64 time=0.266 ms (DUP!)
64 bytes from 192.168.60.3: icmp_seq=3 ttl=64 time=0.157 ms
64 bytes from 192.168.60.3: icmp_seq=3 ttl=64 time=0.219 ms (DUP!)
64 bytes from 192.168.60.3: icmp_seq=4 ttl=64 time=0.199 ms
64 bytes from 192.168.60.3: icmp_seq=5 ttl=64 time=0.212 ms
Also when I do packet capture I see on the representor port:
root@compute-05:/home/ubuntu# tcpdump -nnn -i enp129s0f0_3
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp129s0f0_3, link-type EN10MB (Ethernet), capture size 262144 bytes
07:47:58.405728 IP 192.168.60.15 > 192.168.60.3: ICMP echo request, id 1436, seq 1, length 64
07:47:58.454229 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 1, length 64
07:47:59.407672 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 2, length 64
07:48:01.452416 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 4, length 64
07:48:02.476327 IP 192.168.60.3 > 192.168.60.15: ICMP echo reply, id 1436, seq 5, length 64
07:48:03.468085 ARP, Request who-has 192.168.60.3 tell 192.168.60.15, length 46
07:48:03.479822 ARP, Reply 192.168.60.3 is-at fa:16:3e:ef:73:e1, length 46
07:48:03.491375 ARP, Request who-has 192.168.60.15 tell 192.168.60.3, length 46
07:48:03.491514 ARP, Reply 192.168.60.15 is-at fa:16:3e:d3:c0:7d, length 46
That means only the requests are being offloaded and the replies are not.
At this point because of the lack of documentation I am pretty much out of ideas what should I do. Anyone else made bonding with asap2 work with ovs and neutron? What am I missing?
Everything works just fine when not using a bond.
Any help would be appreciated! :)
Thanks,
Zoltan