Enabling second Infiniband adapter for IPoIB under CentOS 7.4

Running CentOS 7.4 with MLNX_OFED_LINUX-4.3-1.0.1.0 (OFED-4.3-1.0.1)

with TWO dual-port ConnectX-4 100Gbit EDR adapters for 4 ports total.

ibstats shows all 4 ports

ibstat -l

mlx5_0

mlx5_1

mlx5_2

mlx5_3

ibdev2netdev -v

0000:17:00.0 mlx5_0 (MT4115 - MT1609X08073) CX456A - ConnectX-4 QSFP28 fw 12.22.1002 port 1 (ACTIVE) ==> ib0 (Up)

0000:17:00.1 mlx5_1 (MT4115 - MT1609X08073) CX456A - ConnectX-4 QSFP28 fw 12.22.1002 port 1 (ACTIVE) ==> ib1 (Up)

0000:65:00.0 mlx5_2 (MT4115 - MT1545X04735) CX456A - ConnectX-4 QSFP fw 12.22.1002 port 1 (ACTIVE) ==> ib2 (Down)

0000:65:00.1 mlx5_3 (MT4115 - MT1545X04735) CX456A - ConnectX-4 QSFP fw 12.22.1002 port 1 (ACTIVE) ==> ib3 (Down)

All 4 ports are active for the SRP protocol for storage.

There are scripts in /etc/sysconfig/network-scripts for “ifcfg-ib0” and “ifcfg-ib1” but not for devices ib2 and ib3.

How do you get the initial “ifcfg” scripts created for ib2 and ib3? The critical setting in these files is the UUID for the IPoIB device, which I don’t know how to query or generate.

I suspect there is a config file setting somewhere that needs to be set, but I can’t determine which configuration setting is limiting the IBoIP scan to only 2 devices.

There ARE entries in /sys/class/net for all for devices …ib0, ib1, ib2, and ib3:

ls -al /sys/class/net

total 0

drwxr-xr-x 2 root root 0 Apr 23 09:24 .

drwxr-xr-x 63 root root 0 Apr 23 08:28 …

lrwxrwxrwx 1 root root 0 Apr 23 09:24 em1 → …/…/devices/pci0000:00/0000:00:1f.6/net/em1

lrwxrwxrwx 1 root root 0 Apr 23 09:24 ib0 → …/…/devices/pci0000:16/0000:16:00.0/0000:17:00.0/net/ib0

lrwxrwxrwx 1 root root 0 Apr 23 09:24 ib1 → …/…/devices/pci0000:16/0000:16:00.0/0000:17:00.1/net/ib1

lrwxrwxrwx 1 root root 0 Apr 23 09:24 ib2 → …/…/devices/pci0000:64/0000:64:00.0/0000:65:00.0/net/ib2

lrwxrwxrwx 1 root root 0 Apr 23 09:24 ib3 → …/…/devices/pci0000:64/0000:64:00.0/0000:65:00.1/net/ib3

lrwxrwxrwx 1 root root 0 Apr 23 09:24 lo → …/…/devices/virtual/net/lo

lrwxrwxrwx 1 root root 0 Apr 23 09:24 p1p1 → …/…/devices/pci0000:b2/0000:b2:00.0/0000:b3:00.0/net/p1p1

lrwxrwxrwx 1 root root 0 Apr 23 09:24 p1p2 → …/…/devices/pci0000:b2/0000:b2:00.0/0000:b3:00.1/net/p1p2

Thanks for your help.


Follow-on … I found that the command “uuidgen” can be used to generate the UUIDs for ib2 and ib3 … and with this information I can create the ifcfg-ib2 and ifcfg-ib3 files

This should solve my problem, but does not identify why the files were not created in the first place.

Hello and thank you for taking the time to reply.

I actually was able to solve the problem myself, but it was tricky.

The key piece of information was that there were two different generations of Mellanox IB cards in the system. One was a ConnectX-3 based 56 gbit IB card, and the other was a newer ConnectX-4-based 100 Gbit card … and the ConnectX-4 based card supports “extended IP over IB”.

The second key piece of information is that the Mellanox OFED driver defaults for IPoIB have changed from version 3 to version 4.

With the current version of Mellanox OFED, if you have ONLY ConnectX-4 or newer cards, the driver will by default enable “Extended IP over IB”, which enables the full set of hardware assisted offloads for the IP stack which is present in the card’s ASIC since the card also supports Ethernet. However, by enabling the “Extended IP over IB” mode, “Connected mode” is disabled … and the IP over IB interface will be initialized with the datagram mode with a 4K MTU size. The Mellanox switch is set up for 4K MTU.

This same current version of Mellanox OFED, running with a ConnectX-3 card, which does NOT support the “Extended IP over IB” mode will default to Connected mode with the 64K MTU size, not datagram mode.

This all happens if the setting in the openib.conf file is set to “SET_IBOIB_CM=auto” which is the default.

You can verify your operating mode by cat’ing the file /sys/class/net/ib{x}/mode

So the next question is … what happens when there is both a ConnectX-3 and ConnectX-4 card in the system?

The defaults that are applied when you have “SET_IBOIB_CM=auto” appear the lowest common denominator … and since the ConnectX-3 card is incapable of supporting “extended IPoIB” mode, “extended IPoIB” mode is disabled, and the default connection mode for ALL the Mellanox cards is then set to “connected” mode with the 65K MTU … not datagram mode.

Well, all the other nodes on the IB network had (only) ConnectX-4 cards, and had defaulted to “extended IPoIB” enabled and datagram mode, with full offloads … and the 4K MTU.

This new system with the dual Mellanox cards came up in connected mode … and could not establish a connection to any of the other nodes that were running in datagram mode.

The solution was to edit the /etc/infiniband/openib.conf file and change the setting to “SET_IBOIB_CM=no”, which would initialize the interfaces in datagram mode, NOT connected mode. For the ConnectX-3 card, there would be no “extended IPoIB” offloads, but on the ConnectX-4 card’s ports, “extended IPoIB” would be enabled.

The key observation was when the "ib{x} ports came up (when in auto-connected mode), they displayed the MTU size of 65K, and if you tried to decrease the MTU size using ifconfig or “ip” you could not. If you checked /sys/class/net/ib{x}/mode, you would see the port was in connected mode, which locks you into a 64K MTU.

When I configured SET_IBOIB_CM=no, the ib{x} interfaces came up in datagram mode, with the 4K MTU, and IP-based connections could be made with the other nodes on the IB network that were all running in datagram mode.

So … beware of “auto” settings in the openib.conf file if you are running systems with different generations of Mellanox IB controllers. The value you get for “auto” on a ConnectX-4 card can be different than older generations … resulting in settings that may not interoperate.

If you have multiple generations of Mellanox cards (and perhaps multiple versions of OFED drivers) … I would recommend NOT using any “auto” settings, and explicitly specifying what you want that parameter setting to be.

Also … from our preliminary testing … IPoIB performance running in datagram mode with 4K MTU and “extended IPoIB” mode enabled (enabling full offloads) on ConnectX-4 cards, on 4.5 GHz Intel Skylake I9 CPUs resulted in single-stream socket performance after running mlnx_tune of 35-45 gbits using iperf2. Changing to connected mode with 65K MTU size (requiring that Extended IPoIB be disabled as the module parameter) resulted in higher performance … typically about 10 Gbits higher … or 45-55 gbits.

mlnx_tune does NOT set up any MSI-X irq affinity which would help performance further, and leaves irqbalance running. mlnx_tune DOES set the irq_affinity_hints … which some very new versions of irqbalance will honor … but the irqbalance bundled with RHEL/CentOS 7.x (which is a Linux 3.10 kernel) does NOT support irq_affinity_hints in irqbalance. The result is NO irq affinity after running mlnx_tune. We have not yet tested setting the irq affinity correctly.

more …

Hi David,

As mentioned in the support case you opened.

Based on your information provided, we see that the physical link of ib2 and ib3 is down. That means no cable connection or switch port disable. For the following instruction to work, you need to make sure that the physical link is up.

During the installation of the OS and the physical link is up, the installer will create the ‘ifcfg-’ files. If after installation you want to recreate the 'ifcfg-’ files, you can do so with the NetworkManager cli commands.

To see which devices currently exist in the OS:

nmcli device

This will provide you a list of device as per example below:

nmcli device

nmcli device | awk ‘NR==1 || /ib/’

DEVICE TYPE STATE CONNECTION

ib0 infiniband disconnected –

ib1 infiniband unavailable –

To regenerate the configuration file for ‘ib0’ (ifcfg-ib0), execute the following command:

nmcli device connect ib0

ls -l /etc/sysconfig/network-scripts/ifcfg-ib0

-rw-r–r–. 1 root root 363 May 24 17:38 ifcfg-ib0

cat /etc/sysconfig/network-scripts/ifcfg-ib0

HWADDR=80:00:00:85:FE:80:00:00:00:00:00:00:7C:FE:90:03:00:33:69:CC

CONNECTED_MODE=no

TYPE=InfiniBand

PROXY_METHOD=none

BROWSER_ONLY=no

BOOTPROTO=dhcp

DEFROUTE=yes

IPV4_FAILURE_FATAL=no

IPV6INIT=yes

IPV6_AUTOCONF=yes

IPV6_DEFROUTE=yes

IPV6_FAILURE_FATAL=no

IPV6_ADDR_GEN_MODE=stable-privacy

NAME=ib0

UUID=0a9dcfac-a81f-433d-9e4d-c7a42795322d

DEVICE=ib0

ONBOOT=yes

If for some reason, the ‘ifcfg-*’ file is not generated, please remove the connection first with the UI ‘nmtui’, and repeat the previous steps.

Many thanks.

Cheers,

~Mellanox Technical Support

continued …

So the Mellanox OFED documentation that suggests that higher performance is achieved by enabling “extended IBoIP” mode, which enables full offloads by forces datagram mode with a max MTU of 4K … is NOT correct under RHEL/CentOS 7.x … even after running mlnx_tune.

For RHEL/CentOS 7.x, connected mode, with the 65K MTU and with “extended IPoIB” disable yields faster single stream performance by about 25-35% … and can easily saturate 100 Gbits with 3 streams from iperf2.

I suspect that if the irq affinities were properly set host IO stack processing would become more efficient, and the value of the hardware offloads and coalescing (even with the smaller MTU) would be more significant.

mlnx_tune is “broken” for RHEL/CentOS 7.x because mlnx_tune sets up irq_affinity_hints … expecting that irqbalance will read the hints and set the irq affinity appropriately. Unfortunately, the in-box irqbalance for RHEL/CentOS 7.4.1708 does NOT support irq_affinity_hints properly.

We have not yet tested Extended IPoIB, in datagram mode, with 4K MTU with proper irq affinity.

This lower-performance level with Extended IPoIB is unfortunate because the socket bypass message accelerator software “VMA” requires Extended IPoIB mode using datagram mode, not connected mode.

Thank you for your replies and trying to help.

Also …

Please note that the “CONNECTED_MODE=” setting in the /etc/sysconfig/network-scripts/ifcfg-ib{x} file ONLY has effect if you “Extended IPoIB” mode is NOT enabled. This means that the file /etc/modprobe.d/ib_ipoib.conf file should have a line:

options ib_ipoib ipoib_enhanced=0

The default for this setting is “1” or enabled for Mellanox HCA cards that support the feature … which will force datagram mode on a ConnectX-4 based card (or newer). If the module option ipoib_enhanced is enabled … the ifcfg-ib{x} “CONNECTED_MODE=” setting has NO effect.

According to the documentation, if the MAC address starts with a bunch of leading zero fields, it indicates that enhanced IPoIB mode is enabled. If the MAC address looks typical … such as starting with a0:00:02:08:fe:80:00:00:00:00:00:00 … it indicates that normal, non-enhanced IPoIB mode is being used … which allows you to switch between connected mode and datagram mode … and then the ifcfg-ib{x} “CONNECTED_MODE=” setting does affect how the port is initialized.

Regards,

Dave B