Proper Configuration for IB-FDR and RoCE

Hello,

we have a few large clusters which came with Mellanox dual-port HCAs (QDR+10GigE). Initially the clusters were setup as RoCE clusters but now we have acquired and continue acquiring IB FDR fabric infrastructure.

One the cluster with the dual-port QDR+10GigE some MPI stacks (OpemMPI 1.6.5 or 1.7.2 and IntelMPI 4.1.1) started getting confused with communication stalling at times completely.

When I do an $ ibstatus I am getting

$ ibstatus

Infiniband device ‘mlx4_0’ port 1 status:

default gid: fe80:0000:0000:0000:78e7:d103:0023:91ad

base lid: 0x1

sm lid: 0x21

state: 4: ACTIVE

phys state: 5: LinkUp

rate: 40 Gb/sec (4X QDR)

link_layer: InfiniBand

Infiniband device ‘mlx4_0’ port 2 status:

default gid: fe80:0000:0000:0000:7ae7:d1ff:fe23:91ad

base lid: 0x0

sm lid: 0x0

state: 4: ACTIVE

phys state: 5: LinkUp

rate: 10 Gb/sec (1X QDR)

link_layer: Ethernet

When both ports are configured, is there any special setting so that both 10GigE/RoCE and IB parts work without interfering one with the other? Do I need to setup opensm which manages the IB part to only use the IB port for IB fabric management? Can you please suggest any guidelines for this situation with both ports being configured?

Is there any adverse effect having BOTH RoCE and IB operating on a cluster at the same time?

Systems run RHEL 6.3 using the stock OFED and opensm that came with RHEL 6.3.

uname -a :

Linux host 2.6.32-279.25.2.el6.x86_64 #1 SMP Tue May 14 16:19:07 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

thanks …

Michael

Hi Lui,

thanks for the reply!

Here is the ibstatus and ibv_devinfo output :

$ ibstatus

Infiniband device ‘mlx4_0’ port 1 status:

default gid: fe80:0000:0000:0000:24be:05ff:ff91:fee1

base lid: 0x1c

sm lid: 0x12

state: 4: ACTIVE

phys state: 5: LinkUp

rate: 56 Gb/sec (4X FDR)

link_layer: InfiniBand

Infiniband device ‘mlx4_0’ port 2 status:

default gid: fe80:0000:0000:0000:26be:05ff:fe91:fee2

base lid: 0x0

sm lid: 0x0

state: 4: ACTIVE

phys state: 5: LinkUp

rate: 10 Gb/sec (1X QDR)

link_layer: Ethernet

*$ ibv_devinfo *

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.11.1008

node_guid: 24be:05ff:ff91:fee0

sys_image_guid: 24be:05ff:ff91:fee3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x0

board_id: HP_0230240019

phys_port_cnt: 2

port: 1

state: PORT_ACTIVE (4)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 18

port_lid: 28

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 1024 (3)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: Ethernet

Investigation lead to OpenMPI r4commendations to avoid using the same

default GID prefix for both IB and Eth ports.

Is this something you recommend? Or how should I go about making a clean

configuration where both 10GigE and IB are properly setup?

One other twist; some nodes have RoCE enabled. How do I configure IB so

that it wont interfere with RoCE and vice versa?

Thanks!

Michael

Hello Michael,

From what I see the configuration you are describing should be workable. When you described your issue you said that the connections became confused and stalled at times. I feel you were implying below this that the issue came from using both InfiniBand and Ethernet on one card.

Could you elaborate on this error?

Is there a specific output you are seeing?

What is the traffic like on these ports during this error condition?

Is it when both cards are putting traffic on egress on both ports, or ingress, or a mix?

I noticed you are using the RHEL6.3 Community OFED, have you tried your success with our driver?

Could you provide the ibv_devinfo output for your machines? I would like to see the PSID of your cards.

Hi Michael,

Open MPI by default tries to run from all capable RDMA ports in the system.

Since our HCA and driver support RoCE, it tries to run over the 10GbE port as well.

To include/exclude the desired HCA/port for OpenMPI, use the mca parameter, for

example:

%mpirun -mca btl_openib_if_include “mlx4_0:1,mlx4_1:1”

<…other mpirun parameters…>

In your case it should be:

%mpirun -mca btl_openib_if_include “mlx4_0:1” <…other mpirun parameters…>

Thanks, that’s a good point… I guess the OpenMPI stack things that it

has 2 phy transports and trying to load-share, runs into connectivity

problems.

Do you think we can have both RoCE and IB active on the same set of hosts?

Some groups here would like to use RoCE but of course MPI/IB is the

communcation of choice.

Actually here is a question for the selection of routes among end-points:

from the Fat-Tree topology we have multiple alternative paths from

connecting each end-point (X, *Y *). Who determines the specific route

communication will take from two specific end-points (A, B ) ? Is it

the MPI stack itself at IB connection establishment time or does it consult

with SM ? Re-routing (or selection of alternative than the initial one) can

take place at the request of say the MPI stack or does the SM have to be

consulted or adjust its own routing tables or ?

Finally we are using the OFED that just came with RHEL 6.3 (I think 1.5.4

?) for various mostly non-technical reasons. Do you have any concrete

argument in favor of deploying Mellanox own latest OFED for that Linux

distribution?

Thanks!

Michael

Hi Eddie,

thanks for the reply!

Yes, we also notice that if we include explicitly the right IB i/f MPI

communication proceeds smoothly.

Is it something I can do with (say) opensm or in some other place that we

can ensure that MPI stacks use a specific i/f ? It so happens that some

groups my like to use RoCE independently of MPI (as in using GASNET from

UPC). Can we have RoCE and IB coexit but not interfere with each other when

either MPI is used or independently ?

I saw some recommendations in the OpenMPI site to provide different deafult

GID prefix for the 10GigE i/f from that of the IB i/f.

Thanks…

Michael

Hi Michael,

If you are referring to:

FAQ: Tuning the run-time characteristics of MPI InfiniBand, RoCE, and iWARP communications FAQ: Tuning the run-time characteristics of MPI InfiniBand, RoCE, and iWARP communications

Then it actually talks about when the node have 2 IB ports connected to a different IB fabric.

unfortunately you have to use the -mca btl_openib_if_include paramater, otherwise the traffic will be automatically

load balanced across the InfiniBand and Ethernet ports.

Message from Michael Thomadakis attaches