IP over Infiniband @ FreeBSD 11.2: fatal Kernel trap 12 after packet length >2044 bytes in connected mode

Hello all,

we have a problem on a FreeNAS iSCSI server (release is FreeNAS-11.2-U5).

We have a dual port Mellanox Infiniband ConnectX-3 card which is connected to a Infiniband switch (Grid Director 4036).

We have three Proxmox cluster nodes connected to the switch, which are running Proxmox VE 5.4-13.

The Infiniband cards on both ends are configured for connected mode with a MTU of 40950.

We are using a multipath setup with two subnets for IP over Infiniband.

This is working and we get a throughput of ~1-1.1 Gigabyte per second on each cluster node in parallel.

Sporadically we get a kernel trap on the FreeNAS server which is then rebooting.

This can happen from every hours up to 4 days.

The VMs are not crashing, they are in a delay until the FreeNAS server is online again.

Nevertheless we have to fix it.

The root cause is a packet over ipoib with a length >2044 bytes.

FreeBSD-kernel-trap_packetSizeProblem

We are wondering where it comes from.

In a Linux Infiniband kernel documentation we found this:

In datagram mode, the IB UD (Unreliable Datagram) transport is used and so the interface MTU has is equal to the IB L2 MTU minus the IPoIB encapsulation header (4 bytes). For example, in a typical IB fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. In connected mode, the IB RC (Reliable Connected) transport is used.

Connected mode takes advantage of the connected nature of the IB transport and allows an MTU up to the maximal IP packet size of 64K, which reduces the number of IP packets needed for handling large UDP datagrams, TCP segments, etc and increases the performance for large messages.

In connected mode, the interface’s UD QP is still used for multicast and communication with peers that don’t support connected mode. In this case, RX emulation of ICMP PMTU packets is used to cause the networking stack to use the smaller UD MTU for these neighbours.

Because of the overall performance we assume that we have multicast package problem here?

Has somebody any hint where we could look for the root cause?

Does anybody has an idea where the 4188 bytes packet come from?

Thank you in advance.

Hello Ralf,

Many thanks for posting your question on the Mellanox Community.

Unfortunately, Mellanox does not provide support for the FreeNAS BSD distribution. Support needs to be obtained through the FreeNAS/FreeBSD community.

Based on the information provided, we recommend to set the MTU to 65520 which is the default MTU size for connected_mode. You mentioned you already set the adapters to connected_mode, so setting this MTU size will not affected anything.

Also we recommend to update to the latest update, which is U5 is FreeNAS is rapidly evolving with bugfixes and new features.

And for last, please make sure to update the adapter with the latest f/w version available.

Many thanks,

~Mellanox Technical Support

Hello all,

thank you for your answers.

FreeNAS is 11.2-U5, no further updates are available, same on Proxmox side.

All Infiniband cards are on the latest FW version available.

We tried also different IB hardware (ConnectX-2, ConnectX-3) and firmware revisions without any difference.

The behavior is always the same. At the moment we don`t believe that we have a firmware or hardware revision issue.

With a MTU of 65520 we got SCSI errors on client side (VMs) while testing heavy I/O throughput in parallel on different Linux VMs at all three Proxmox test cluster nodes.

With the smaller MTU of 40950 this problem was gone.

It is clear that you cannot support Linux or FreeBSD/FreeNAS, but nevertheless I will share what we found out. Maybe somebody has an idea or any hint.

As I cannot post a bigger message I will share a link to the FreeNAS forum instead:

https://www.ixsystems.com/community/threads/ip-over-infiniband-fatal-kernel-trap-12-after-packet-length-2044-bytes-in-connected-mode.78690/post-546946

Interesting here is the changing of the IP addresses between the link layer addresses on the Proxmox clients which seems to be the root cause.

Thank you in advance for your comments.

Regards,

Ralf

Hello all,

I think I have found an answer to the address change problem on the Linux clients:

I found a comment on embeddedlinux.org:

  • A Linux host replies to any ARP solicitation requests that specify a target IP address configured on any of its interfaces, even if the request was received on this host by a different interface. To make Linux behave as if addresses belong to interfaces, administrators can use the ARP_IGNORE feature described later in the section “/proc Options.”
  • Hosts can experience the ARP flux problem, in which the wrong interface becomes associated with an L3 address. This problem is described in the text that follows.

other sources:

http://www.mellanox.com/related-docs/prod_gateway_systems/BXOFED_Release_Notes-1.5.1-1.3.6_for_Oracle.txt

- When multiple vNics are connected to the same network, hosts can experience the “ARP flux” problem, in which the wrong interface becomes associated with an L3 address (FM #87335).

Workaround:

Set the following kernel configuration parameters: include the following lines in /etc/sysctl.conf and reboot the machine:

net.ipv4.conf.all.arp_ignore=1

net.ipv4.conf.all.arp_announce=2

https://downloads.openfabrics.org/OFED/archive/ofed-1.4-daily/release/OFED-1.4-docs/ipoib_release_notes.txt

3. Known Issues ===============================================================================

1. If a host has multiple interfaces and

(a) each interface belongs to a different IP subnet,

(b) they all use the same InfiniBand Partition, and

(c) they are connected to the same IB Switch,

then the host violates the IP rule requiring different broadcast domains.

Consequently, the host may build an incorrect ARP table.

The correct setting of a multi-homed IPoIB host is achieved by using a different PKEY for each IP subnet.

If a host has multiple interfaces on the same IP subnet, then to prevent a peer from building an incorrect ARP entry (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This causes the network stack to send ARP replies only on the interface with the IP address specified in the ARP request:

sysctl -w net.ipv4.conf.ib0.arp_ignore=1

sysctl -w net.ipv4.conf.ib1.arp_ignore=1

Or, globally,

sysctl -w net.ipv4.conf.all.arp_ignore=1

For the running kernel on each client I executed following:

echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore; echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce; echo 1 >/proc/sys/net/ipv4/conf/ib0/arp_ignore; echo 1 >/proc/sys/net/ipv4/conf/ib1/arp_ignore

I added the corresponding post-up lines to /etc/network/interface to get it permanent*.*

​Hopefully the kernel trap 12 is gone now.

I received an arp address change message on server side every 1 to 15 minutes. This is gone now.

There are no such arp messages since ​2 hours.

Surprisingly I could also switch to a MTU of 65520 which was not working previously without lots of connection errors on each client.

On one client there is still something what I have to check. I had two events like that:

connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx last ping now

Sep 9 04:53:45 pvecn3 kernel: [453492.780173] connection3:0: detected conn error (1022)

Sep 9 04:53:45 pvecn3 kernel: [453492.780363] scsi_io_completion: 10 callbacks suppressed

Conclusion:

FreeBSD/FreeNAS is doing a good job, Linux was the problem child.

​Regards,

Ralf