Omnios + RSF-1 + Inifiniband

Hi,

I have a problem with a HA storage solution that has now been sitting in ‘development’ for a very long time now. We have set up a 2 node Omnios system with 3 disk racks, 2 ZFS pools, 1 each running from each node. RSF-1 has been set up on the system. We are using 2 IS5022 swicthes with the subnet manager running on a dedicated server.

The storage is used to supply 2 Windows 2012 R2 (SQL server cluser) and and ESXi 5.5 cluster. We have tested NFS(IPoIB), iSCSI(IPoIB), and SRP for ESXi, and iSCSI (IPoIB).

Connections come up the system runs, but the problem comes when a failover of the pool from 1 node to the other occurs, the link doesn’t always come back up. If the pool was failed back, as soon as its imported and the interfaces are configured again data starts flowing.

The failover takes approx 30 seconds to occur, if there are no VMs running on the datastore then the likely hood of the datastore coming back online appears to be greater. With windows it is also random it coming back up. I have changed (increased the timeouts) on all the systems to ensure that its not dropping the lun (all paths down)

Is there a reason this could be happening, could it be that the subnet manager needs prompting to update the links (i dont know much about what the subnet manager really does and when), could it be a driver issue that the esxi server isnt updating its equivalent arp table or a switch problem?

From what I remember, havent done much work on the system in a while, even after rebooting the esxi servers the datastores arent guaranteed to be available, so doent appear to be esxi or timeout problems on the clients, any tested needed can be done, as I have said this is in development at the moment.

I am going to test the system using just gb ethernet to see how that goes. If that works it is down to infiniband,

Any help on getting this resolved would be greatly appreciated.

Any info that you need I can provide.

Thanks

David

I assume you have dual port IB cards.

You should:

  • connect the IB card with both switches

  • define per subnet a pkey (we have default and 4 pkeys from 60 to 63)

Example for partition.conf:

Defaultpartition

Default=0x7fff, ipoib, mtu=4 : ALL=full, SELF=full ;

Namenskonvention key z.B. key60=0x803C 60 (dezimal)=3C (hexadecimal)

key60=0x803C, ipoib, mtu=4 : ALL=full;

key61=0x803D, ipoib, mtu=4 : ALL=full;

key62=0x803E, ipoib, mtu=4 : ALL=full;

key63=0x803F, ipoib, mtu=4 : ALL=full;

  • define a virtual switch in ESX, standard or distributed does not matter

  • define in ESX the first port as active and the second port as standby (this is important)

  • define per subnet a portgroup and use the pkey as VLAN id

This works for us in our clustered storage setup with Solaris 11.2 and corosync/pacemaker.

We use 3 subnets: storage, vmotion, backup.

Each of the 4 IB switches are connected with 2 other switches as a mesh of 4 switches.

If you use just one port per subnet you disable failover between ports.

And I guess you would like to use redundancy and automatic failover if you use 2 switches.

You may try to change the order of active/standby per portgroup=subnet=VLAN if you want to have traffic over all ports.

And we use NFS over IPOIB and not ISCSI or SRP with datastores sizes between 5 and 170 TB.

NFS is simpler and fast enough with IPOIB.

Andreas

Hi,

Thanks for replying. Its not from what i understand, and if it was, then I would think it would still function, the communication for RSF-1 is over ethernet. The storage network is using infiniband, this is using IPoIB for iSCSI on windows as SRP support was removed in 2012, and NFS for ESXi, do to the VMs being exposed in the snapshots making vm recovery from snapshots easier, SRP was tested also, didnt provide much greater performance overall.

The storage nodes have a management network over ethernet, which is also use for heartbeats, they have a serial link for heartbeat and a set of quorum disks. Failover in these tests were manually initiated which unmounts the pool, removes the network configuration, remounts it on the other node and reconfigures the network.

The infiniband network config and subnet manager could be the problem.

Each system has 2 ports, 1 port from each is connected to an independent switch (no link between them), so port 1 on all systems go to switch 1 and port 2 switch 2.

Port 1 is on subnet 10.200.46.0/25 and port 2 is on 10.200.46.128/25 (IPoIB).

But both subnets have the same pkey. In the IPoIB release notes it says that different subnet need a different pkey if they are on the same switch, otherwise arp updates may produce an incorrect route. These are not on the same switch, but thought that it could be the problem. But checking the arp updates on ESXi, appear to show the IP addresses moving over to the correct MAC and using the correct interface.

There are 2 ESXi NFS datastores, running over the different subnets, datastore 1 over subnet1 and 2 over 2.

The subnet manager could also be a problem, reading different config appear to show different notations, so not sure which is correct. The correct partition config is: Default=0xffff,ipoib,rate=7,x mtu=5,defmember=full:ALL; The subnet manager logs also produce some errors multiple times:

583281 [23991700] 0x01 → __osm_mcmr_rcv_join_mgrp: ERR 1B10: Provided Join State != FullMember - required for create, MGID: ff12:401b:ffff::2 from port 0x0002c903002af1cf (MT25408 ConnectX Mellanox Technologies)

558044 [23190700] 0x02 → osm_report_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:9 GID:fe80::2:c903:2a:f16f

I have set all the systems to use an MTU of 4K, debian, the subnet manager didnt want to go above 2K until it was set into connected mode. I was going to set everything back to 2K as its default to be sure thats not causing a problem.

I have 2 setups that i was going to try, the first was, keep the layout the same as now, but have 2 pkeys, 1 for each subnet and change the MTU back to default, see if that works. The next is to use only 1 subnet and pkey and link the 2 switches together, again with 2K MTU. The later is not the recommended config for an iSCSI network with multiple paths, so isnt really wanted.

Any additional help with this would be appreciated, and any additional info you may need I will try to provide.

Thanks

I assume the failover is triggered by ARP calls by RSF?, this would explain by ethernet works as expected.

If this is the case, i am not sure if a solution exists.

You may need to investigate a different storage solution such as Ceph, which will be able to handle failures itself and also support the infiniband fabric.

Additional info.

Have tested with ethernet, same setup, just using ethernet instead and it works perfect for failover. a few seconds with the datastore not contactable and then comes back online.

Other testing done with infiiniband is that when 1 pool/datastore is failed over to the other node, if it does come up, then it will most likely work for the most of the time, but if you failover a second pool, then this second pool will not work initially, if after around 5 minutes it does actually come up as visible from the other node, then the pool that was running ok will almost always then fall off and no longer be visible, even though nothing has actually been done with it. Then sometimes after again 5 minutes or so it will come back.

I thought that it may have been arp updates being the problem, but looking at the arp tables on the esxi servers shows that they are updating correctly.

I remember coming across this exact problem a long time ago setting up a similar config.

I went back through some old correspondence and indeed found that it was ARP requests not being honoured by the infiniband HCAs that was the root cause of the problem.

Apparently whilst IB cards do have ARP functionality, they do not honour the command to update their ARP for any given IP address.

This contradicts your experiences as you suggest that you can see the ARP updates on the interface so I am a little confused.

I would not think that the subnet manager would become involved at all in this scenario as the LID assignment would only ever change on a reboot.

Have you tried using connected mode? Packet headers are slightly different compared to Datagram mode, and is worth a shot.

RFC 4391 - Transmission of IP over InfiniBand (IPoIB) RFC 4391: Transmission of IP over InfiniBand (IPoIB)

Hi,

thanks for the help, I will give it a go when I’m back at work next week, off for Easter now

You are correct, using a dual port card.

Again thanks for the information and that you are using something very similar to me, which is working. Hopefully I will be able to get it working next week and report back. If not hopefully you will be willing to provide a little more help

Decided on NFS for ESXi for the same reason, simpler, easier access to the VMs in the snapshots.

Thanks again.

Hi,

Thanks, I have finally set this up now. It appears to be working, failover occurs and the datastores come backup after a few seconds, so looking good.

I have a couple of additional questions for you. For the link failover of the storage side are you using IPMP?

Have you worked with windows also with the storage, I have 2 servers in a cluster connected (MSSQL), the failover again works ok for windows, in terms of the storage being available, but randomly the storage is reported offline after the pool is back online, more often if data is being written, the cluster manager and has to be manually enabled again. This occurs the moment the storage is available again, before that it reports its ok. Just wondering if you had seen this, could be some of the timeouts that I had been messing around with before to see if i could fix the previous problem.

Thanks

Hi David

Glad to to hear that yout got in a working condition at least for non-Windows.

The failover in Solaris is realized with an IPMP Group containing the datalinks:

storage_64_0 803D net5 up ----

storage_64_1 803D net6 up ----

ipadm add-ipmp -i storage_64_0 -i storage_64_1 storage61

ipadm create-addr -T static -a 192.168.61.64/24 storage61/v4addr

ipadm set-ifprop -p standby=on -m ip storage_64_1

You may set one of the datalinks as standby. Check it out how it works for your Installation.

Some years ago we tried to use infiniband with Windows2003. There was no multipathing available and so we stopped to using it with Windows.

And not with the Cluster functionality of Windows. You should check out the SCSI requirements of Windows Cluster which must be supported by the storage. If my Memory serves well there were issues with SCSI-3 reservations.

Andreas