How to configure host chaining for ConnectX-5 VPI

spickartz · June 5, 2018, 2:50pm

Hello,

I have four nodes equipped with dual-port ConnectX-5 VPI adapter cards. If I understand the product brief correctly, these support a cluster setup in terms of a ring topology (aka “host chaining”). Unfortunately, I could not find any documentation on how to configure the cards and Subnet Manager (SM) instances to connect all nodes of the cluster without an intermediate switch.

Could you tell me if there is any documentation on that topic? Or do you have further information regarding this?

Cheers,

Simon

rdarbha · July 9, 2018, 9:22am

Hi Simon,

question: do you want to use Storage Spaces Direct in Windows Server 2016 with it? That is at least my problem.

Cheers Carsten Rachfahl

Microsoft Clud & Datacenter Managment MVP

rdarbha · August 16, 2018, 3:13pm

You’re welcome!

I’m glad I helped someone after all the headache I went through for it.

I have no hard experience with VMWare, and so take all of this with a grain of salt.

First thought is vlan tags. I was told that VMWare tags by default.

From my (limited) understanding and thoughts, host chaining inside VMware is not a good idea.

If you setup a virtual switch (on the vmware side) and put both ports of the card on the switch, give that switch an IP, that would allow for vmotion and such over the link at close to line speed. Letting the switch (analogous to openvswitch) do all of the routing, and fast pathing.

Thoughts - If there was host chaining:

Vmware still sees both ports (we can’t assign IPs to raw port interfaces to start with.)

It doesn’t really know which port to send out, so it could take the extra hop before it gets to the destination.

Three node, desired going from A → B might take the path of A → C → B

Where I can talk is non-chaining speed.

We did try using openswitch and the cards with chaining off. So long as the stp stuff is turned on; we got nearly line speed.

We opened a support ticket for our problems with MTU. It took a while, but we found the problem.

They have a nice little utility (sysinfo-snapshot) for seeing the card internals and OS config options which helped us (by looking through it.)

See my post below. Host_chaining is not supported on ESXi at this time.

rdarbha · October 29, 2018, 2:15pm

Ah; that diagram looks right, all on the same subnet, and all connected in a correct ring.

If I had to take a guess, lower the MTU back to 1500 on all the nodes (both interfaces) ifconfig ib0 mtu 1500 ; ifconfig ib1 mtu 1500

We had issues with high MTU throwing host_chaining into a weird packet drop situation; which looks like what might be happening here. They said that it was fixed in a newer FW, but I wasn’t able to fully test and make sure it was fixed.

If that doesn’t work, I’m out of ideas. Support will give you a script to run on all the nodes; and that’s’d be my next action. They have a lot of useful information in that report; so it is worth a look before you send it off.

I’ve been disappointed with Mellanox with regards to documentation on any of this feature.

nils.ullmann · October 29, 2018, 2:25pm

Disappointment also on my side :-(

But thank you so much for your help.

nils.ullmann · October 29, 2018, 2:03pm

No, it’s turned on and I’m not running ESXi, I’m running Debian 9.5. Here’s my setup:

Node PVE1

Port1: 172.31.31.11/24 - connected to PVE2 Port2

Port2: 172.31.31.21/24 - connected to PVE4 Port1

root@pve1:~# mlxconfig q | grep HOST_C

HOST_CHAINING_MODE BASIC(1)

HOST_CHAINING_DESCRIPTORS Array[0…7]

HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0…7]

Node PVE2

Port1: 172.31.31.12/24 - connected to PVE3 Port2

Port2: 172.31.31.22/24 - connected to PVE1 Port1

root@pve2:~# mlxconfig q | grep HOST_C

HOST_CHAINING_MODE BASIC(1)

HOST_CHAINING_DESCRIPTORS Array[0…7]

HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0…7]

Node PVE3

Port1: 172.31.31.13/24 - connected to PVE4 Port2

Port2: 172.31.31.23/24 - connected to PVE2 Port1

root@pve3:~# mlxconfig q | grep HOST_C

HOST_CHAINING_MODE BASIC(1)

HOST_CHAINING_DESCRIPTORS Array[0…7]

HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0…7]

Node PVE4

Port1: 172.31.31.14/24 - connected to PVE1 Port2

Port2: 172.31.31.24/24 - connected to PVE3 Port1

root@pve4:~# mlxconfig q | grep HOST_C

HOST_CHAINING_MODE BASIC(1)

HOST_CHAINING_DESCRIPTORS Array[0…7]

HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0…7]

Any ideas what I can look into?

rdarbha · July 10, 2018, 1:57pm

Putting this out there since we had so many complications with host chaining in order for it to work; and something Google will pick up is infinitely better than nothing.

The idea we had was that we wanted something that would have redundancy. With a switch configuration, we’d have to get two switches, and a lot more cables; very expensive.

HOST_CHAINING_MODE was a great idea, switchless, less cables, and less expense.

You do NOT need a subnet manager for this to work!

In order to get it working:

Aside: There is no solid documentation on this process as of this writing

What Marc said was accurate, set HOST_CHAINING_MODE=1 via the mlxconfig utility.

Aside: Both the VPI and EN type cards will work with host chaining. The VPI type does require you to put it into ethernet mode.

Restart the servers to set the mode.
Put all of the ports on the same subnet. EG. 172.19.50.0/24 Restart networking stack as required.
From there, all ports should be pingable from all other ports.
Set the MTU up to 9000. (see caveats for bug; lower to 8000 if 9k doesn’t work)

Aside: The MTU could be higher; I have been unable to test higher due to a bug in the firmware. Around these forums, I’ve seen 9k floated about, and it seems like a good standard number.

If you aren’t getting the throughput you’re expecting, do ALL of the tuning from BIOS (Performance Tuning for Mellanox Adapters https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters , BIOS Performance Tuning Example https://community.mellanox.com/s/article/bios-performance-tuning-example ) and software (Understanding PCIe Configuration for Maximum Performance https://community.mellanox.com/s/article/understanding-pcie-configuration-for-maximum-performance , Linux sysctl Tuning https://community.mellanox.com/s/article/linux-sysctl-tuning ) for all servers. It does make a difference. On our small (under-powered) test boxes, we gained 20 GBit/s from our starting benchmark.

Another thing to make sure is that you have the proper PCI bandwidth to support line rate; and get the socket direct cards if you do not.

There are a lot of caveats.

The bandwidth that is possible IS link speed, only between two directly connected nodes. From our tests, there is a small dip in performance on each hop; and each hop also limits your max theoretical throughput.
FW version 16.22.1002 had a few bugs related to host chaining; one of those was the max MTU supported was 8150. Higher MTU, less IP overhead.
The ‘ring’ topology is a little funny. It is only one direction. If there is a cable cut scenario, it will NOT route around properly for certain hosts.

Aside: A cable cut is different than a cable disconnect. The transceiver itself registers whether there is a cable attached or not. When there is no cable present on one side, but is on the other, the above scenario is true (not properly routing.) When both sides of the cable are removed, the ring outright stops and does not work at all. I don’t have any data to support an actual cable cut.

The ring works as described in the (scant) documentation, but is as follows from the firmware release notes:

Received packets from the wire with DMAC equal to the host MAC are forwarded to the local host
Received traffic from the physical port with DMAC different than the current MAC are forwarded to the other port:
Traffic can be transmitted by the other physical port
Traffic can reach functions on the port’s Physical Function
Device allows hosts to transmit traffic only with its permanent MAC
To prevent loops, the received traffic from the wire with SMAC equal to the port permanent MAC is dropped (the packet cannot start a new loop)

If you run into problems, tcpdump is your friend, and ping is a great little tool to check your sanity.

Hope any of this helps anyone in the future,

Daniel

rdarbha · September 5, 2018, 12:41pm

Just some due diligence here.

We put our ConnectX5 cards in our 3 host vmware 6.5 stack, and did not get it to work with host_chaining. We ended up contacting support about it, and the reply we got wasn’t optimistic.

“Host-chaining is currently not supported as it is not GA for ESXi.”

So my previous post was a grain of salt, and marked out accordingly.

I have yet to see any documentation on host_chaining specifically; which is really sad, since As far as I know, my post above is the best available.

march · June 6, 2018, 8:45am

Hi Simon,

Please check you have the latest firmware version installed.(16.22.1002)

ibv_devinfo will give you your fw version.

Start MST

mst start

mst status -v (to see your current device)

mlxconfig -d /dev/mst/ set HOST_CHAINING_MODE=1

See the release notes of the FW here:

http://www.mellanox.com/pdf/firmware/ConnectX5-FW-16_22_1002-release_notes.pdf http://www.mellanox.com/pdf/firmware/ConnectX5-FW-16_22_1002-release_notes.pdf

BR

Marc

spickartz · June 6, 2018, 8:51am

Hi Marc,

Thanks for your answer! The document you refer to states that

Both ports should be configured to Ethernet when host chaining is enabled.

Is there any way to connect the four nodes without a switch using native InfiniBand?

Best regards,

Simon

march · June 6, 2018, 9:57am

Hi,

Only ethernet is supported.

Marc

sgrass · August 14, 2018, 2:50pm

Hi Daniel,

I wanted to thank you for this directions they were very helpful. I was successful in linking three nodes together, all running Ubuntu 18.04. I was able to get ~96Gbs in speed between all the host using iperf2. I then took one of the boxes and loaded ESXi 6.7, and configured the same IP address on the two interface I had before. The VMware box can not communicate with the others now. I can communicate through the Nic between the other Ubuntu boxes. When I run a tcpdump on the ESXi I see the ARP request getting created, but get no response. I am wondering if you have any idea why the Chaining feature does not seem to work with ESXi?

Thanks

Shawn

cgomezdelpulgar · August 22, 2018, 6:56am

Hi,

I have problem to pinging between the nic, this is my configuration:

SERVER 1: PORT1:192.168.10.10 PORT2: 192.168.10.11

SERVER 2: PORT1:192.168.10.12 PORT2: 192.168.10.13

SERVER 3: PORT1: 192.168.10.14 PORT2: 192.168.10.15

mlxconfig -d mt4119-pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2

mlxconfig -d mt4119-pciconf0 set HOST_CHAINING_MODE=1

mlxfwreset --device mt4119_pciconf0 reset

All commands works perfect, but only pingin ports interconnected, i need pinging all ports.

My configuration is correct?

rdarbha · August 22, 2018, 12:52pm

That config looks correct. I’m being that guy… I’d be tempted to do a full machine restart.

Make sure you’ve issued those commands to the other servers, and done a restart to solidify the config.

I haven’t used the mlxfwreset command, but looking at the docs, without the level argument, it is only doing the lowest level of what the adapter supports.

A physical ‘shutdown -r now’ has always worked for me.

cgomezdelpulgar · August 30, 2018, 7:38am

it still does not work

What drivers are you using?

nils.ullmann · October 29, 2018, 8:18am

Hi Daniel,

I went through all those steps, but still the HOST_CHAINING isn’t working for me. Any additional ideas I can go for?

What I noticed is: Sending a ping from A to B looks the following. ICMP Request is sent correctly from A to B, but Bs

arp request before sending the ICMP answer moves down the line from B to C and C discards the answer.

For me it looks like the HOST_CHAIN is still not working. But on the same page, I have no glue what to do next.

rdarbha · October 29, 2018, 1:35pm

From what I gather, You might not have host_chaining enabled on C; or you might be using VMWare.

Host chaining is all done on-card, and so the host kernels are not aware of it.

Since chaining works based off of the destination mac; if C doesn’t have chaining on; C will see that the packet wasn’t meant for it, and not bother replying/rejecting/dropping/forwarding the packet.

With chaining on; the ASIC on the card for C will forward it without sending it to the kernel. The host won’t even know that there was a packet to start with.

Something else that I might look at is the arp tables. Could it be possible that with other tests, the table is poisoned? I haven’t seen it, but host_chaining is something else…

Topic		Replies	Views
ConnectX-3 on VMWare ESXi 6.7 with inbox driver link not coming up Virtualization For Infiniband And Ethernet	16	3144	February 28, 2020
Will there be ESXi 5.x driver support for ConnectX adapters? Mellanox OFED	17	357	May 17, 2013
Does ConnectX-6 still limit the Host Chaining feature to Ethernet only or will the feature also work when InfiniBand is used ? Adapters and Cables	2	872	March 2, 2021
ConnectX-3 on VMware ESXi 6.5 not seen completely in the UI or command line Software And Drivers	4	1754	January 15, 2020
Dell Z9100-ON Switch + Mellanox/Nvidia MCX455-ECAT 100GbE QSFP28 Question Ethernet Switches	8	881	January 12, 2025
ConnectX-3 not going up in Centos 6.4 (and SL6.4) InfiniBand/VPI Adapter Cards	22	579	June 19, 2013
New to infiniband, can't get a working connection.	22	2086	September 9, 2013
Trouble using ConnectX-5 Ex cards using host chaining mode. Some connections work but not others. Ethernet Adapter Cards	3	1140	May 10, 2023
Is Direct Hardware Port Forwarding with line-rate Conversion possible with ConnectX-5 ? Ethernet Adapter Cards	3	428	November 29, 2019
ConnectX-2: mlx4_en: Port: 1, invalid mac burned: 0x0, quiting Mellanox OFED define	21	1432	April 23, 2013

How to configure host chaining for ConnectX-5 VPI

Related topics