Trunks, pvlans, infiniband world

Hi, I have some hp c7000 blade chassis with some 40g infiniband (unmanaged) switches

we want to connect an all ssd array to the network

i don’t really have an understanding of connecting infiniband switches together

can I add more then one “trunk” between the switches, do I have to worry about loops etc with infiniband ?

also, how can pvlan off the chassis servers so they can’t see each other, only the external storage array

likely using opensm as a start or are we better off getting a switch with built in subnet manager ?

Regards Daniel

Hi Daniel,

you have no loops in this setup …

in any case, I suggest you have a look here:

Designing an HPC Cluster with Mellanox InfiniBand Solutions

And maybe start with this:

Understanding Up/Down InfiniBand Routing Algorithm

it will help you understand the routing algorithms and networking for InfiniBand.

This is very small network. I would check that all the ports are being used, if there are not so many flows - let’s say just one flow, only one port will be utilized - (same in Ethernet)…

Hi Daniel,

  1. You can add links between the switches, but what is the current topology of the network?

is it just a standalone C7000 blade chassis (each with an IB blade switch) or is there a 1U IB switches as well somewhere?

  1. Using an opensm from MLNX_OFED can do the trick, this will be achieved using the partitions feature

Maybe the master will know ophirmaor :-)

Hi Daniel,

Assuming that adaptive routing is not enabled, available paths will be statically balanced, between each pair of Local IDs (LIDs). If only one LID exists on each end of the 2 cables, only one cable will be used. If a server has 2 ports (2 LIDs), there will be two sets of paths to that server from every other LID in the subnet, and the routing algorithms will try to statically allocate half of the paths to each cable. But the existence of multiple available paths doesn’t mean they’ll be utilized. Load balancing among two ports on the same server is a function of the Upper Layer Protocol (IPoIB, SRP, etc.) IPoIB doesn’t do load balancing; MPI stacks do…

See some examples here as well regarding unused links.

VPI Gateway Considerations


Hi, Sorry for the delays, Please see below physical layout

I"m wondering if anything special needs to be setup like in the Ethernet world to prevent loops etc…?

Regards, Daniel

Can you add another figure with all the servers?

Hi, there will be 3x chassis at first with 16 blades in each

Each blade will have a duel port 40g mezz card connecting to the chassis 40g switches to get external ssd San storage

I’ll be using srp targets in this first setup

It might look like overkill for now but future arrays will likely be 24x nvme slots so I’ll need as much performance as possible

Our initial testing in our lab using only ddr 20g cards show the below from 2 ssd drives and some zfs read cache

Hi, I do have another question

If I have dual links connected to the same IB partition , will these load balance to get to the host ?

I’m running a Solaris SRP Target and I’m trying to figure our how the clients with reach a dual honed host

Hmm, Interesting, I did notice that the Node GUID was different from the Port GUID’s

So as long as the target address is the node GUID the IB will load balance ?

As an example:

# ibstat CA 'mlx4_0' CA type: MT25418 Number of ports: 2 Firmware version: 2.3.0 Hardware version: a0 Node GUID: 0x0002c9030002fb04 !!!!!!!!!!!!!!!!!!! System image GUID: 0x0002c9030002fb07 Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x02510868 Port GUID: 0x0002c9030002fb05 !!!!!!!!!!!!!!!!!!! Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c9030002fb06 !!!!!!!!!!!!!!!!!!!