OpenMPI not finding the device

Hello,

I’m not sure why OpenMPI cannot find my HCAs as the system is returning the following state:

PCI devices:


DEVICE_TYPE MST PCI RDMA NET NUMA

ConnectX5(rev:0) NA 5e:00.0 mlx5_0 net-enp94s0f0 0

ConnectX5(rev:0) NA 5e:00.1 mlx5_1 net-enp94s0f1 0

but calling the app with either {mpirun -np 36 --mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 app.exe}, {mpirun -np 36 --mca btl_openib_if_include mlx5_0 -x UCX_NET_DEVICES=mlx5_0 -x HCOLL_MAIN_IB=mlx5_0 app.exe} or {mpirun -np 36 --mca btl_openib_if_include mlx5_1 -x UCX_NET_DEVICES=mlx5_1 -x HCOLL_MAIN_IB=mlx5_1 app.exe} always return some error message:

[1588295158.027413] [baseHPCbench:26725:0] ucp_context.c:690 UCX WARN network device ‘mlx5_0:1’ is not available, please use one or more of: ‘eno2’(tcp)

[1588295187.676307] [baseHPCbench:27101:0] ucp_context.c:690 UCX WARN network device ‘mlx5_0’ is not available, please use one or more of: ‘eno2’(tcp)

[1588295315.261353] [baseHPCbench:28270:0] ucp_context.c:690 UCX WARN network device ‘mlx5_1’ is not available, please use one or more of: ‘eno2’(tcp)

The app runs but using ‘tcp’ not the IB port.

Thank you,

Arturo

Hello again,

I’m still seeing the same issue. I upgraded to CentOS 7.8 with MOFED 5.0-2.1.8.0-rhel7.8-x86_64 and everything seems to start OK. The status command returns:

sudo mst status -v

MST modules:

------------

MST PCI module is not loaded

MST PCI configuration module loaded

PCI devices:

------------

DEVICE_TYPE MST PCI RDMA NET NUMA

ConnectX5(rev:0) /dev/mst/mt4121_pciconf0.1 5e:00.1 mlx5_1 net-enp94s0f1 0

ConnectX5(rev:0) /dev/mst/mt4121_pciconf0 5e:00.0 mlx5_0 net-enp94s0f0 0

and the adpaters are there:

lspci | grep Mell

5e:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

5e:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

However, OpenMPI & UCX are still unable to use them with every rank returning a message similar to:

[1589072572.935421] [instanceHPC1:8577 :0] ucp_context.c:690 UCX WARN network device ‘mlx5_0’ is not available, please use one or more of: ‘eno2’(tcp)

Thanks.

Note: This instance doesn’t use a hypervisor.

Hi Arturo,

can you run the mpirun with below command :

"-x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1”,

no need btl_openib_if_include,

i.e

mpirun -np 36 -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 app.exe

Thanks,

Samer

Hi Samer,

I saw your message but have to finish a couple of other projects. Will try to test your proposed solution tomorrow or Saturday.

Thanks,

Arturo

Hello Samer,

I apologize for the long delay as several other matters required my immediate attention. Excluding ‘btl_openib_if_include’ didn’t make any difference as OpenMPI-UCX is still unable to find the device (same error messages). The device (at least mlx5_0) is there:

sudo ibv_devinfo

hca_id: mlx5_0

transport: InfiniBand (0)

fw_ver: 16.23.1020

node_guid: 506b:4b03:00cb:c1be

sys_image_guid: 506b:4b03:00cb:c1be

vendor_id: 0x02c9

vendor_part_id: 4121

hw_ver: 0x0

board_id: ORC0000000003

phys_port_cnt: 1

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 1024 (3)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: Ethernet

Thanks,

Arturo