How to connect two (small) fat tree networks

system · October 3, 2016, 9:22am

Hello,

we run two (small) fat tree IB networks in our datacenter. Now we would like to merge the two into a single IB fabric - but we don’t want to create a single fat tree. All we need is the bandwidth of 2-3 IB connections between the two fabrics. Is this possible?

Details:

Each existing fabric has two layer of switches: Two upper layer switches and 3-4 lower layer switches. Most clients connect to the lower layer switches. Some central fileserver connect to the top layer switches. We use Up/Down routing.

Can we just connect the top layer switches of each fabric to one upper layer switch on the on the other fabric - would this work with Up/Down or some other routing?

The two fabrics are not close and we weed fibre cabling between them, hence we would like to keep them independent fat tree topologies with just a few connections between to get access to storage local to each fabric.

Kind regards,

Heiner Billich

ophirm · October 7, 2016, 6:25pm

Combining IB fabrics is quite common, and sharing storage is often the reason. There are enough detailed considerations that you should consult with your local Mellanox sales engineer.

However, there are some general points that may be helpful.

Although not suggested in the original question, it is tempting to simply connect a few cables from the lower (L1) switches on Cluster A to the L1 switches on Cluster B, but this is not recommended. It may seem to work, but it can have unexpected side effects. For example, when using Up/Down routing, if the spine switches of Cluster A are declared as the Root switches, the L1 switch of Cluster B that connects to Cluster A is treated as being ‘below’ the L1 switches in Cluster A. The spine switches of Cluster B become even ‘lower’, and the L1 switches of Cluster B are lower still because they are farthest from the roots. Effectively one L1 switch from Cluster B-- the one connected to Cluster A-- is closer to Cluster A. If the storage is in Cluster A, the storage latency from one portion of Cluster B will be two switch hops less than from the rest of Cluster B.

Now assume that for some reason, e.g. resiliency, two L1 switches on Cluster B are connected to two switches on Cluster A. Call the Cluster B L1 switches ‘X’ and ‘Y’. If Cluster A is the root for Up/Down routing, switches X and Y will be closer to the root switches than the other L1 switches in Cluster B. Now, if a node on switch X talks to a node on switch Y, the traffic will be routed through the spine of Cluster A, via cluster A L1 switches-- not through the spine of cluster B. The latency between Switches X and Y is now higher, by two switch hops, than the latency between other L1 switches in Cluster B. For compute nodes this is probably not a good outcome.

Another point: Connecting nodes to spine switches, e.g. fileservers, is generally discouraged. Such nodes are often the reason that combining two IB fabrics requires a more detailed analysis. The analysis depends on the traffic patterns to and from those nodes. To address the original question more specifically: cross-connecting between the spines of two clusters is possible but not common, and it may not be possible with nodes connected directly to any spine. There are two approaches that are more common. One approach is to connect spine(s) of one cluster to one of more L1 switches of the other cluster, i.e. connect Cluster B as a sub-tree of Cluster A, and form a 4-tier fabric. Another approach is to connect the spines of Cluster B as through they are L1 switches of cluster A. In both approaches, the key is to do this effectively with only a handful of cables.

Let us know if you prefer to take this offline with a local Mellanox resource.