Pinging IPoIB address fails

Hi Folks,

I’ve got a strange issue with being able to ping other IP addresses over IB. This is a live environment so I know the issue is with this machine or it’s configuration. Where might my issue lie?

[root@transfer ~]# ping -c 3 10.12.0.1

PING 10.12.0.1 (10.12.0.1) 56(84) bytes of data.

From 10.12.200.17 icmp_seq=1 Destination Host Unreachable

From 10.12.200.17 icmp_seq=2 Destination Host Unreachable

From 10.12.200.17 icmp_seq=3 Destination Host Unreachable

— 10.12.0.1 ping statistics —

3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3000ms

Now, I recently installed the MLNX_OFED_LINUX-4.3-1.0.1.0-rhel6.9-x86_64 driver on a Centos 6.9 system and here is my environment in a nutshell:

“ip addr” shows this:

10: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 256

link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1e:05:b1 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

inet 10.12.200.17/16 brd 10.12.255.255 scope global ib0

And interface file looks like this:

=========

DEVICE=ib0

ONBOOT=yes

NM_CONTROLLED=no

BOOTPROTO=none

IPADDR=10.12.200.17

PREFIX=16

MTU=1500

==========

However, the port is up according to IB:

[root@transfer ~]# ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.36.5000

node_guid: e41d:2d03:001e:05b0

sys_image_guid: e41d:2d03:001e:05b3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x1

board_id: DEL1090001019

phys_port_cnt: 2

Device ports:

port: 1

state: PORT_DOWN (1)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 1

port_lid: 29

port_lmc: 0x00

link_layer: InfiniBand

I notice the MTU size from “ibv_devinfo” seem to conflict with “ip addr” output

Some excerpts from lspci -vvv | grep -i mell :

04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Subsystem: Mellanox Technologies Device 0065

Product Name: CX354A - ConnectX-3 QSFP

Read-only fields:

[PN] Part number: 01T7NW

[EC] Engineering changes: A00

[V0] Vendor specific: PCIe Gen3 x8

Capabilities: [18c v1] #19

Kernel driver in use: mlx4_core

Kernel modules: mlx4_core

[root@transfer ~]# ibhosts

src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,2,1,4,26 Attr 0x11:0) bad status 110; Connection timed out

Ca : 0x001e67030068e310 ports 1 “prod-0034 HCA-1”

Ca : 0x001e67030068e3f8 ports 1 “prod-0026 HCA-1”

Ca : 0x001e67030068e3a0 ports 1 “prod-0044 HCA-1”

Ca : 0x001e67030068caf0 ports 1 “prod-0043 HCA-1”

Ca : 0x001e67030068c9a8 ports 1 “prod-0023 HCA-1”

Ca : 0x001e67030066de04 ports 1 “prod-0022 HCA-1”

Ca : 0x001e670300670844 ports 1 “prod-0021 HCA-1”

Ca : 0x001e67030068ccc0 ports 1 “prod-0020 HCA-1”

Ca : 0x001e67030068d230 ports 1 “prod-0019 HCA-1”

Ca : 0x001e67030067257c ports 1 “prod-0047 HCA-1”

. . .

Hi Siji,

First issue I see in the above is that it’s port 2 that’s connected rather than port 1 so either move the cable from port 2 to port 1 or use ib1 rather than ib0.

– Hal

That was it Hal, Thanks!