Network topo for multi MPI clusters and one storage cluster?

We’re in the process of integrating a parallel storage cluster (3 storage servers) with 3 independent HPC MPI clusters and I have a question about the best network topology for our situation.


Each MPI cluster uses a singe 36port IB switch and has 30 nodes. The clusters are not currently joined over IB, they work independently. MPI communication will not traverse from one MPI cluster to the next; controlled at the user space. The storage cluster consists of 3 file servers and operates as a parallel file system using RDMA over IB (each compute node must “see” all 3 file servers and each file server must see the other two). We already configured the file system and ran successful tests while connected to one MPI cluster using a single switch (just connected the 3 file servers to the same switch as the nodes).


Need to join the 3 storage servers (i.e. storage cluster) to each of the 3 MPI clusters so that every node can “see” each file server over RDMA.

Each MPI cluster must continue to operate independently (we think we can control this in the user space).

We’re not concerned about over prescribing the lanes to the file servers. Data transfer throughput is not a huge concern.

Cost is a concern.

I attached an illustration with 2 design ideas. I think 3 ports at each MPI switch for storage traffic will be more than sufficent to handle the througput. At a minimum, I think all we have to do is use a 4th switch to join the storage cluster (Design #1). But I’m not 100% certain this is the best approach. In Design #2, I assume RDMA traffic will traverse updn through the L2 switch and MPI traffic will be limited to only the nodes within each MPI switch (again, controlled at the user level). Which is the ideal design, if any? What is the best routing algorithm(s) for the SM? What are the potiential problems?

Many many thanks in advance.

RDMA-MPI_network-design-mellanox.pdf (27.5 KB)

Please let me know how it works – should be ideal config.


Scot Schultz

Director, HPC and Technical Computing

Mellanox Technologies

350 Oakmead Parkway, Suite 100, Sunnyvale CA, 94085

Office: 408-916-0018, Mobile: 408-444-1364, Fax: 408-585-0318

Hi Scot,

Many thanks. We’ll probably go with design #1 and use default SM as you suggest.


Hello John,

I looking at the pdf; design 1 should work. The independent clusters MPI traffic should remain local. Design 2 could be modified for storage high availability if you wanted to go this route, but is not needed otherwise.

Regarding the routing algorithm should just use out of the box SM; I don’t see any other option here to improve upon with this configuration.