We’re in the process of integrating a parallel storage cluster (3 storage servers) with 3 independent HPC MPI clusters and I have a question about the best network topology for our situation.
Each MPI cluster uses a singe 36port IB switch and has 30 nodes. The clusters are not currently joined over IB, they work independently. MPI communication will not traverse from one MPI cluster to the next; controlled at the user space. The storage cluster consists of 3 file servers and operates as a parallel file system using RDMA over IB (each compute node must “see” all 3 file servers and each file server must see the other two). We already configured the file system and ran successful tests while connected to one MPI cluster using a single switch (just connected the 3 file servers to the same switch as the nodes).
Need to join the 3 storage servers (i.e. storage cluster) to each of the 3 MPI clusters so that every node can “see” each file server over RDMA.
Each MPI cluster must continue to operate independently (we think we can control this in the user space).
We’re not concerned about over prescribing the lanes to the file servers. Data transfer throughput is not a huge concern.
Cost is a concern.
I attached an illustration with 2 design ideas. I think 3 ports at each MPI switch for storage traffic will be more than sufficent to handle the througput. At a minimum, I think all we have to do is use a 4th switch to join the storage cluster (Design #1). But I’m not 100% certain this is the best approach. In Design #2, I assume RDMA traffic will traverse updn through the L2 switch and MPI traffic will be limited to only the nodes within each MPI switch (again, controlled at the user level). Which is the ideal design, if any? What is the best routing algorithm(s) for the SM? What are the potiential problems?
Many many thanks in advance.
RDMA-MPI_network-design-mellanox.pdf (27.5 KB)