VXLAN woes

I’ve been having some major problems with VXLANs on switches with Broadcom Trident II+ ASICS and could use some assistance. I’m deploying a bunch of new hypervisors and want to be able to moves VMs from one to the other at will. Each hypervisor plugs into a pair of leaf switches operating as an MLAG pair. Here’s an example of what that config might look like on one of my leaf switches.

interface hv04
bond-slaves swp7
bridge-pvid 1200
bridge-vids 1718 1740
clag-id 7
mstpctl-bpduguard yes
mstpctl-portadminedge yes

Then, my VLAN and VXLAN:

interface vlan1740
address-virtual 00:00:5E:00:01:B0 10.1.1.1/28
vlan-id 1740
vlan-raw-device bridge

interface vni1740
bridge-access 1740
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 1740
vxlan-local-tunnelip 10.0.128.2 (switch’s loopback IP)

I then advertise each of my VNIs via BGP:

router bgp 4210000002

address-family ipv4 unicast
redistribute connected

address-family l2vpn evpn
neighbor FABRIC activate
advertise-all-vni

Right off the bat, if I only have this configuration on a single leaf pair and deploy a VM on the hypervisor in VLAN 1740, I start to see some problems. Theoretically, since I’m redistributing all of my connected routes via BGP, everything on my network should be able to get to VLAN 1740. In fact, if I look at my spine switches, I see routes to 10.1.1.0/28 from this leaf pair. In reality, my VM can only talk to very specific devices. As it odd as this sounds, I find myself only being able to ping other VMs, either on this hypervisor or on others in other racks. The VMs those VLANs belong to may or may not have a VXLAN attached to it. If I try to ping other devices, I see the ICMP request hit them, but the reply gets lost before it hits my VM. Now, I imagine this doesn’t have anything to do with whether the device is a physical one or a VM, but, rather, whether traffic to/from it is VLAN tagged or not.

NVIDIA has posted some information on some of the caveats of Broadcom Trident II+ switches here (Inter-subnet Routing | Cumulus Linux 4.1) and the workaround of basically turning every VLAN into a VXLAN sorta works. When I tried this on a pair of leaf switches, I found VMs hanging off of other leaf switches, also in VXLANs, could ping all of the physical servers in the rack where I made the updates. However, this seemed to have killed communication to the hypervisor in the rack where I made the update as well as the VMs on it. I rolled back my changes and suddenly had communication issues with most devices in the rack. I eventually solved those by ifdown’ing and ifup’ing every VLAN on the switches. This didn’t fix the issue with the hypervisor, though, and I found that traffic leaving the hypervisor going to just one of the leaf switches was getting blackholed. The only way to fix that issue was to delete the LACP bond to the hypervisor on the problematic switch and then re-creating it. I actually had a similar issue at one point on another of my hypervisors and this also solved that issue.

I’m really hesitant to move forward here. Theoretically, I can turn each of my VLANs into a VXLAN on each of my switches and everything should work and I just might need to re-create the bond to my hypervisor. However, it worries me that I could suddenly have similar issues with just about any bonded devices on my network at some other time. My switches are on version 4.2 and I can only upgrade to 4.3 since they have Broadcom ASICs. I’ve taken a look through the release notes, though, and I doubt the upgrade will make any difference.
I’m thinking about taking a pair of leaf switches and turning every VLAN into a VXLAN again. I kind of hope this breaks my hypervisor so I can troubleshoot things and possibly come up with a solution. When things were broken, I ran some “net show bgp l2vpn evpn route” commands on my switches and the problematic switch was both sending and installing routes. What else could I look at? What could I be missing here?

Also, FWIW, I’ve tested my setup in GNS3 on version 4.2 and everything works great. It sure seems like all these issues come down to the fact that I’m using Broadcom switches.