Hi Folks,
I’ve got a strange issue with being able to ping other IP addresses over IB. This is a live environment so I know the issue is with this machine or it’s configuration. Where might my issue lie?
[root@transfer ~]# ping -c 3 10.12.0.1
PING 10.12.0.1 (10.12.0.1) 56(84) bytes of data.
From 10.12.200.17 icmp_seq=1 Destination Host Unreachable
From 10.12.200.17 icmp_seq=2 Destination Host Unreachable
From 10.12.200.17 icmp_seq=3 Destination Host Unreachable
— 10.12.0.1 ping statistics —
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3000ms
Now, I recently installed the MLNX_OFED_LINUX-4.3-1.0.1.0-rhel6.9-x86_64 driver on a Centos 6.9 system and here is my environment in a nutshell:
“ip addr” shows this:
10: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 256
link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1e:05:b1 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
inet 10.12.200.17/16 brd 10.12.255.255 scope global ib0
And interface file looks like this:
=========
DEVICE=ib0
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
IPADDR=10.12.200.17
PREFIX=16
MTU=1500
==========
However, the port is up according to IB:
[root@transfer ~]# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.36.5000
node_guid: e41d:2d03:001e:05b0
sys_image_guid: e41d:2d03:001e:05b3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: DEL1090001019
phys_port_cnt: 2
Device ports:
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 29
port_lmc: 0x00
link_layer: InfiniBand
I notice the MTU size from “ibv_devinfo” seem to conflict with “ip addr” output
Some excerpts from lspci -vvv | grep -i mell :
04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies Device 0065
…
Product Name: CX354A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: 01T7NW
[EC] Engineering changes: A00
[V0] Vendor specific: PCIe Gen3 x8
…
Capabilities: [18c v1] #19
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core
[root@transfer ~]# ibhosts
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,2,1,4,26 Attr 0x11:0) bad status 110; Connection timed out
Ca : 0x001e67030068e310 ports 1 “prod-0034 HCA-1”
Ca : 0x001e67030068e3f8 ports 1 “prod-0026 HCA-1”
Ca : 0x001e67030068e3a0 ports 1 “prod-0044 HCA-1”
Ca : 0x001e67030068caf0 ports 1 “prod-0043 HCA-1”
Ca : 0x001e67030068c9a8 ports 1 “prod-0023 HCA-1”
Ca : 0x001e67030066de04 ports 1 “prod-0022 HCA-1”
Ca : 0x001e670300670844 ports 1 “prod-0021 HCA-1”
Ca : 0x001e67030068ccc0 ports 1 “prod-0020 HCA-1”
Ca : 0x001e67030068d230 ports 1 “prod-0019 HCA-1”
Ca : 0x001e67030067257c ports 1 “prod-0047 HCA-1”
. . .