Configuring for multiple NICs, bonded provisioning interfaces, and multiple system disks of the same variety

keenan.troll · December 23, 2024, 10:12pm

I’m hoping that sharing these tricks will save other linux less-than-masters the (embarassingly long) time it took us to figure them out. They solved our kernel-loading woes with multiple NICs, provisioning over bonded interfaces, and local disks. Since clusters commonly provision nodes with1 disk if any for the OS, using one ethernet port, it’s not trivial to glean all of this from the admin manual. Maybe this will help inform future BCM patches or documentation updates. Also, you may or may not experience similar issues if you’re not using Ubuntu.

Use net.ifnames=1 in kernel parameters for “slot-based” consistent network interface identifiers

Guide

The traditional ‘eth[0-9]’ names are allocated in the order the kernel finishes loading them, which is inconsistent. The ‘altnames’ normally present and usable in booted OS for the interfaces (eno*, enp*…) dont’ seem to resolve in the node installer context.
net.ifnames=1 makes those altnames the primary and only identifier for those interfaces. Then these names can (and must) be used in the interface configs, and the node-installer will work. If you have dell hardware, biosdevname=1 instead is supposed to work similarly. You may need to first boot a node into linux however you can, just to determine these interface names. Then:

Create both physical slave interfaces using these names, ip 0.0.0.0, network “none”
Create bond0 with the desired IP on internalnet, ‘mode’ (LACP = 4) and the two slave member interfaces
set bond0 as the node’s provisioning interface. Don’t commit changes before this unless you want BCM to complain.
fiber interfaces will also get a special ‘slot name’ instead of e.g. ib0
make the BMC interface as you normally would
Then you can commit and clone this node object and need only type the IPs.
On our head node we also made alias interface bond0:i on dracnet, that way the same bonded connection gets used for all that stuff.
Assuming your switches are set up properly you’ll be good to go.

Use udev rules to solve a similar devname shuffling problem with local disks

Guide

Our nodes boot from nvme drives in m.2 raid modules, and have additional nvme scratch drives in pcie slots. As with network interfaces, the traditional block device names in /dev/ get inconsistent numbers based on kernel loading order, if you have multiple drives of the same variety - in our case /dev/nvme[#]n1.
Since UUIDs are unique and partition labels must be preset, the only general solution was to copy a udev rules file /cm/node-installer/etc/udev/rules.d/99-z.rules (also to /cm/images/[image]/etc/udev/rules.d/99-z.rules but not sure if it’s necessary to do both). This worked to add unique symlinks in /dev/ for disks based on hardware attributes consistent across nodes - pcie slot is ideal and works for most cases; we used parent device model for the m.2 nvme raid cards.
The attributes used are the output lines from e.g. udevadm info -a /dev/nvme0n1, ignoring any trailing whitespace in quoted string values. You can assert any attributes from the device itself, plus any from one parent device in the chain. The first device that matches all device attributes and has a parent device that matches all parent attributes will get the symlink.
In our case, to ensure that the symlinks pointed to the block device itself (and not the parent subsystem at /dev/nvme# or any incidental existing nvme#n#p#) we found these rules to consistently work:

# boss-n1 
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{model}=="Dell BOSS-N1", SYMLINK+="nvmen1" 
# pcie nvme's
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{address}=="0000:c1:00.0", SYMLINK+="nvmepc1" 
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{address}=="0000:c2:00.0", SYMLINK+="nvmepc2"
SUBSYSTEM=="block", KERNEL=="nvme?n1", ATTRS{address}=="0000:43:00.0", SYMLINK+="nvmep43"

If you always use these symlinks in your disk layouts (e.g. /dev/nvmepc1), the node installer will correctly resolve the symlinks as it does its thing. It also worked for defining a software raid volume in BCM. You could also use them in FSmounts if you mount disks that way, generally you can just use /dev/disk/by-path links, but not all devices get an appropriate /dev/disk link, plus it looks neater.

Shorewall may be a problem

Guide

One other snag we experienced is that Shorewall seems to misbehave and get in the way of head node externalnet stuff, at least on a “stock” Ubuntu 22 BCM install. Disabling the service autostart and the service itself in BCM, and even the system daemon via systemctl, appears to be insufficient to prevent it from firing at boot and reinstating the problematic iptables, you may need to just apt purge shorewall on the system. Just ensure to replace the NAT rule masquerading the externalnet interface for internet access to your nodes, and any other firewall policies you need at the head node level. We just used iptables -t nat -A POSTROUTING -o [external interface] -j MASQUERADE , installed and ran iptables-persistent, and get by with the dedicated network hardware doing the firewalling.

Topic		Replies	Views
331.20 WHQL long-term driver discussion Linux	30	14392	November 24, 2013
NVMe sometimes lost on reboot - pcie_aspm=off influence Jetson Orin NX boot , board-design , nvme	32	465	July 26, 2024
Jetson TX2 NVMe Hotplug/Hotswap Jetson Xavier NX nvme	13	2402	October 18, 2021
How to enable the function of PCIe NVMe SSD on TX1? Jetson TX1	12	1891	October 18, 2021
[nvbandwidth] Debug an Anomalous Host to Device Memory Bandwidth CUDA Programming and Performance	7	887	November 30, 2023
Ubuntu tesla P40 NVRM: GPU 0000:03:00.0: RmInitAdapter Drivers - Linux, Windows, MacOS kernel , nvbugs	4	1383	March 31, 2023
installing nvme drive? Jetson TX2	32	7364	October 18, 2021
NVSHMEM setup GPU-Accelerated Libraries gpu-computing	0	27	October 6, 2024
Fully Working Patch for Nvidia Driver 340.102 [Compiler/Installer File] and Linux Kernel 4.11 Linux	10	13290	July 16, 2017
Unable to start up normally after flashing Jetson Xavier NX boot	38	436	April 23, 2024

Configuring for multiple NICs, bonded provisioning interfaces, and multiple system disks of the same variety

Related topics