Failed loading HCA driver and Access Layer

Im sorry to ask again, im new to Infiniband so dont know all the tricks and hwo to make it work just yet:)

I managed to install the software as to previous article, and rebooted the blade node.

when i start again now i get this error:

Loading HCA driver and Access Layer = Failed

Please open an ssue in the http://bugs.openfabrics.org http://bugs.openfabrics.org/ and attach /tmp/ib_debug_info_log.

the debug file is a copy of dmesg and it has the following lines

mlx4_ib 80171 0ib_mad 40497 5 ib_cm,ib_sa,ib_umad,mlx4_ib,ib_mthcaib_core 69979 9 ib_cm,ib_sa,ib_uverbs,ib_umad,iw_nes,iw_cxgb3,mlx4_ib,ib_mthca,ib_madmlx4_en 97664 0mlx4_core 185193 2 mlx4_ib,mlx4_en

mlx4_core: Mellanox ConnectX core driver v1.0-mlnx_ofed1.5.3 (November 3, 2011)

mlx4_core: Initializing 0000:03:00.0

mlx4_core 0000:03:00.0: PCI INT A → GSI 48 (level, low) → IRQ 48

mlx4_core 0000:03:00.0: setting latency timer to 64

mlx4_core 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vend or for a firmware update.mlx4_core 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vend or for a firmware update.

mlx4_en: Mellanox ConnectX HCA Ethernet driver v1.5.8.3 (June 2012)

mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0-mlnx_ofed1.5.3 (November 3, 2011)

Apr 28 12:32:48 dpn01 modprobe: FATAL: Error inserting ib_ipoib (/lib/modules/2.6.32-279.el6.x86_64/extra/mln x-ofa_kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko): Unknown symbol in module, or unknown parameter (see d mesg)Apr 28 12:44:44 dpn01 modprobe: FATAL: Error inserting ib_ipoib (/lib/modules/2.6.32-279.el6.x86_64/extra/mln x-ofa_kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko): Unknown symbol in module, or unknown parameter (see d mesg)

Apr 28 12:48:06 dpn01 root[4494]: Set node_desc for mlx4_0: dpn01 HCA-1

root 15022 0 12:29 ? 00:00:00 [mlx4]root 15042 0 12:29 ? 00:00:00 [mlx4_opreq]root 15682 0 12:29 ? 00:00:00 [mlx4_sense]root 15772 0 12:29 ? 00:00:00 [mlx4_en]root 26512 0 12:29 ? 00:00:00 [mlx4_ib]

it looks like its loading the HCA ethernet drivers, but why fail on the other, is it because of the firmware lines above?

any help appreciated.

It seems it somewhat resovles itself:)

as for the otehr info its:

[root@dpn08]# lspci |grep Mel

03:00.0 Network controller [0207]: Mellanox Technologies MT27500 Family [ConnectX-3]

[root@dpn08]# mstflint -d 03:00.0 query

Image type: ConnectX

FW Version: 2.11.550

Rom Info: type=PXE version=3.4.0 devid=4099 proto=VPI

Device ID: 4099

Description: Node Port1 Port2 Sys image

GUIDs: 0002c90300389f60 0002c90300389f61 0002c90300389f62 0002c90300389f63

MACs: 000000000000 000000000000

VSD:

PSID: DEL0A20210018

the only other issue i have now, is that testing with v2 OFED and Centos 6.4, yum update fails due to ibutils-libs missing, and stranges thing when trying to install it.

its part of the packages available to Centos 6.4, but yum search says nothing is available.

will a yum update with ibutils break the OFED package aswell?

coiter https://community.mellanox.com/s/profile/0051T000008EaiPQAS - As a thought, since you’re running CentOS 6.x, you might find it easier to start with the CentOS provided drivers (instead of Mellanox OFED).

From a fresh CentOS install (without Mellanox OFED), you then do:

$ sudo yum groupinstall “Infiniband Support”

In theory that should install working drivers and things should “just work”.

It’s apparently not as optimised as the Mellanox OFED stuff. But it’s a pretty useful way to get up and running at first with minimal hassles.

It’s also pretty easy to remove those packages afterwards though if you want to try a different approach (ie Mellanox OFED):

$ sudo yum groupremove “Infiniband Support”

Hope that’s helpful.

(note - edited for typo fixes)

  1. What OS are you using
  2. Is it possible to get the serial number, firmware and/or PSID?
  3. Where did you download this OFED version?

As an extra thought, when diagnosing new setups on RHEL and CentOS, the output from “lspci” is often helpful.

For example, on a test box here:

$ sudo lspci

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)

00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)

00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)

00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)

00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)

00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)

00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)

00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)

00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4)

00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)

00:1f.0 ISA bridge: Intel Corporation Z77 Express Chipset LPC Controller (rev 04)

00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)

00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 09)

04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03)

This helps us figure out what card the box is seeing, and where it’s located on the PCI bus. (01:00.0 in the example above)

With the CentOS provided “mstflint” package installed (or the Mellanox OFED “flint” equivalent), you can use that PCI address to check the firmware revision of the card(s):

$ sudo yum install mstflint

$ sudo mstflint -d 01:00.0 query

Image type: ConnectX

FW Version: 2.9.1000

Device ID: 25418

Description: Node Port1 Port2 Sys image

GUIDs: 0003ba000100edb8 0003ba000100edb9 0003ba000100edba 0003ba000100edbb

MACs: 0003ba00edb9 0003ba00edba

Board ID: (MT_04A0120002)

VSD:

PSID: MT_04A0120002

The firmware of the card above is version “2.9.1000”, which is actually useful to know.

(note, the mstflint “query” parameter can be abbreviated to just “q”. I used “query” above because it’s easier to mentally follow along with for new users. )

(Note - edited to add mstflint yum command)