A newbie problem with infiniband.

Hi all =)

I am a bit new to the forum, but I have been reading it for quite some time and the posts are very helpful. Thanks!

So I decided that it is worth hopping on the infiniband wagon ( it is clear why - the speed is awesome, also the performance boost and the price has no match ) . BUT …

I have run into some problems setting up the infiniband fabric.

Some information about my setup : HP c7000 with 4 x Proliant BL685c Gen1. each with a HP 4x DDR DUAL PORT MEZZ HCA, I also have a 2 x HP 4x DDR IB Switch Module ( each with 16 downlink ports and 8 physical interfaces - CX4 connectors ) .

I am running VMware ESXi 5.1.0

~ # esxcli system version get

Product: VMware ESXi

Version: 5.1.0

Build: Releasebuild-799733

Update: 0

So far so good, I have installed the drivers needed :

  • Mellanox ESXI 5.0 Driver ( esxcli software vib install -d /tmp/drivers/mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip –-no-sig-check )

  • Mellanox OFED driver ( esxcli software vib install -d /tmp/drivers/MLNX-OFED-ESX-1.8.1.0.zip --no-sig-check )

# esxcli software vib list | grep Mellanox

net-ib-cm 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-ib-core 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-ib-ipoib 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-ib-mad 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-ib-sa 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-ib-umad 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-memtrack 2013.0131.1850-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-mlx4-core 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

net-mlx4-en 1.6.1.2-1OEM.500.0.0.406165 Mellanox VMwareCertified 2014-03-18

net-mlx4-ib 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

scsi-ib-srp 1.8.1.0-1OEM.500.0.0.472560 Mellanox PartnerSupported 2014-03-18

After that I have installed the OpenSM ( esxcli software vib install -v /tmp/drivers/ib-opensm-3.3.15.x86_64.vib –-no-sig-check)

~ # esxcli software vib list | grep open

ib-opensm 3.3.15 Intel VMwareAccepted 2014-03-18

I also configured the OpenSM per adapter with a partitions.conf file (Default=0x7fff,ipoib,mtu=5:ALL=full;), putting this file in the /scratch/opensm/adapter_1_hca/ and /scratch/opensm/adapter_2_hca/ directories

/vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm # ls -la

drwxr-xr-x 1 root root 560 Feb 28 09:59 .

drwxr-xr-x 1 root root 980 Feb 28 09:59 …

drwxr-xr-x 1 root root 420 Mar 18 12:31 0x00237dffff94d87d

drwxr-xr-x 1 root root 420 Mar 18 12:31 0x00237dffff94d87e

/vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm/0x00237dffff94d87d # cat partitions.conf

Default=0x7fff,ipoib,mtu=5:ALL=full;

I have been following those two tutorials :

http://www.vladan.fr/homelab-storage-network-speedup/ http://www.vladan.fr/homelab-storage-network-speedup/

http://www.bussink.ch/?p=1183 http://www.bussink.ch/?p=1183

Now I can see the adapters :

~ # esxcli network nic list | grep Mellanox

vmnic_ib0 0000:047:00.0 ib_ipoib Up 20000 Full 00:23:7d:94:d8:7d 1500 Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

vmnic_ib1 0000:047:00.0 ib_ipoib Up 20000 Full 00:23:7d:94:d8:7e 1500 Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

Also when start ./ibstat I get that :

/opt/opensm/bin # ./ibstat

CA 'mlx4_0’

CA type: MT25418

Number of ports: 2

Firmware version: 2.7.0

Hardware version: a0

Node GUID: 0x00237dffff94d87c

System image GUID: 0x00237dffff94d87f

Port 1:

State: Active

Physical state: LinkUp

Rate: 20

Base lid: 1

LMC: 0

SM lid: 6

Capability mask: 0x0251086a

Port GUID: 0x00237dffff94d87d

Link layer: InfiniBand

Port 2:

State: Active

Physical state: LinkUp

Rate: 20

Base lid: 5

LMC: 0

SM lid: 6

Capability mask: 0x0251086a

Port GUID: 0x00237dffff94d87e

Link layer: InfiniBand

So everything seems to be working, except it is not :

When trying to ping from one host to the other i get that :

/opt/opensm/bin # ./ibping -S -dd

ibwarn: [15174] umad_init: umad_init

ibwarn: [15174] umad_open_port: ca (null) port 0

ibwarn: [15174] umad_get_cas_names: max 32

ibwarn: [15174] umad_get_cas_names: return 1 cas

ibwarn: [15174] resolve_ca_name: checking ca 'mlx4_0’

ibwarn: [15174] resolve_ca_port: checking ca 'mlx4_0’

ibwarn: [15174] umad_get_ca: ca_name mlx4_0

ibwarn: [15174] umad_get_ca: opened mlx4_0

ibwarn: [15174] resolve_ca_port: checking port 0

ibwarn: [15174] resolve_ca_port: checking port 1

ibwarn: [15174] resolve_ca_port: found active port 1

ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with port 1 type 1

ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with active port 1

ibwarn: [15174] umad_open_port: opening mlx4_0 port 1

ibwarn: [15174] dev_to_umad_id: mapped mlx4_0 1 to 0

ibwarn: [15174] umad_open_port: opened /dev/umad0 fd 3 portid 0

ibwarn: [15174] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)

ibwarn: [15174] umad_register: fd 3 registered to use agent 0 qp 1

ibwarn: [15174] umad_register_oui: fd 3 mgmt_class 50 rmpp_version 0 oui 0x0145 method_mask 0xffd0cca0

ibwarn: [15174] umad_register_oui: fd 3 registered to use agent 1 qp 1 class 0x32 oui 0xffd0cc90

ibdebug: [15174] ibping_serv: starting to serve…

ibwarn: [15174] umad_recv: fd 3 umad 0x80579c0 timeout 4294967295

ibwarn: [15174] umad_recv: read returned 4294967232 > sizeof umad 64 + length 256 (Resource temporarily unavailable)

ibwarn: [15174] mad_receive_via: recv failed: Resource temporarily unavailable

ibdebug: [15174] ibping_serv: server out

For some reason I always get the Resource temporarily unavailable message. When I try to do a ./ibping -L the right Lid or ./ibping -G with the right Guid I always get this :

/opt/opensm/bin # ./ibping -G 0x001b78ffff34b9c6

ibwarn: [15237] _do_madrpc: recv failed: Resource temporarily unavailable

ibwarn: [15237] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 6)

ibwarn: [15237] ib_path_query_via: sa call path_query failed

./ibping: iberror: failed: can’t resolve destination port 0x001b78ffff34b9c6

So I would really appreciate any help with getting one nod to ping the other.

I am thinking that my problem might be the HP 4x IB Switch, but it shouldnt be, because with it I could get at least a point to point connection. The switch doesnt have an onboard subnet manager, but I am using OpenSM, so that also shouldnt be the problem.

I want to use the Infiniband connection for a virtual storage between the Proliants, but first I need to verify that there is a connection. Any help would be welcome, also any suggestions

Thanks in advance.

Alex