I’m hoping that sharing these tricks will save other linux less-than-masters the (embarassingly long) time it took us to figure them out. They solved our kernel-loading woes with multiple NICs, provisioning over bonded interfaces, and local disks. Since clusters commonly provision nodes with1 disk if any for the OS, using one ethernet port, it’s not trivial to glean all of this from the admin manual. Maybe this will help inform future BCM patches or documentation updates. Also, you may or may not experience similar issues if you’re not using Ubuntu.
Use net.ifnames=1 in kernel parameters for “slot-based” consistent network interface identifiers
Guide
The traditional ‘eth[0-9]’ names are allocated in the order the kernel finishes loading them, which is inconsistent. The ‘altnames’ normally present and usable in booted OS for the interfaces (eno*, enp*…) dont’ seem to resolve in the node installer context.
net.ifnames=1 makes those altnames the primary and only identifier for those interfaces. Then these names can (and must) be used in the interface configs, and the node-installer will work. If you have dell hardware, biosdevname=1 instead is supposed to work similarly. You may need to first boot a node into linux however you can, just to determine these interface names. Then:
- Create both physical slave interfaces using these names, ip 0.0.0.0, network “none”
- Create bond0 with the desired IP on internalnet, ‘mode’ (LACP = 4) and the two slave member interfaces
- set bond0 as the node’s provisioning interface. Don’t commit changes before this unless you want BCM to complain.
- fiber interfaces will also get a special ‘slot name’ instead of e.g. ib0
- make the BMC interface as you normally would
Then you can commit and clone this node object and need only type the IPs.
On our head node we also made alias interface bond0:i on dracnet, that way the same bonded connection gets used for all that stuff.
Assuming your switches are set up properly you’ll be good to go.
Use udev rules to solve a similar devname shuffling problem with local disks
Guide
Our nodes boot from nvme drives in m.2 raid modules, and have additional nvme scratch drives in pcie slots. As with network interfaces, the traditional block device names in /dev/ get inconsistent numbers based on kernel loading order, if you have multiple drives of the same variety - in our case /dev/nvme[#]n1.
Since UUIDs are unique and partition labels must be preset, the only general solution was to copy a udev rules file /cm/node-installer/etc/udev/rules.d/99-z.rules (also to /cm/images/[image]/etc/udev/rules.d/99-z.rules but not sure if it’s necessary to do both). This worked to add unique symlinks in /dev/ for disks based on hardware attributes consistent across nodes - pcie slot is ideal and works for most cases; we used parent device model for the m.2 nvme raid cards.
The attributes used are the output lines from e.g. udevadm info -a /dev/nvme0n1
, ignoring any trailing whitespace in quoted string values. You can assert any attributes from the device itself, plus any from one parent device in the chain. The first device that matches all device attributes and has a parent device that matches all parent attributes will get the symlink.
In our case, to ensure that the symlinks pointed to the block device itself (and not the parent subsystem at /dev/nvme# or any incidental existing nvme#n#p#) we found these rules to consistently work:
# boss-n1
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{model}=="Dell BOSS-N1", SYMLINK+="nvmen1"
# pcie nvme's
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{address}=="0000:c1:00.0", SYMLINK+="nvmepc1"
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{address}=="0000:c2:00.0", SYMLINK+="nvmepc2"
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{address}=="0000:43:00.0", SYMLINK+="nvmep43"
If you always use these symlinks in your disk layouts (e.g. /dev/nvmepc1), the node installer will correctly resolve the symlinks as it does its thing. It also worked for defining a software raid volume in BCM. You could also use them in FSmounts if you mount disks that way, generally you can just use /dev/disk/by-path links, but not all devices get an appropriate /dev/disk link, plus it looks neater.
Shorewall may be a problem
Guide
One other snag we experienced is that Shorewall seems to misbehave and get in the way of head node externalnet stuff, at least on a “stock” Ubuntu 22 BCM install. Disabling the service autostart and the service itself in BCM, and even the system daemon via systemctl, appears to be insufficient to prevent it from firing at boot and reinstating the problematic iptables, you may need to just apt purge
shorewall on the system. Just ensure to replace the NAT rule masquerading the externalnet interface for internet access to your nodes, and any other firewall policies you need at the head node level. We just used iptables -t nat -A POSTROUTING -o [external interface] -j MASQUERADE
, installed and ran iptables-persistent, and get by with the dedicated network hardware doing the firewalling.