Installation and fireware update problems

Dear All,

I am new to IB and was installing drivers for InfiniBand on a blade server and I met a severe problem with MHGH28-XTC (PSID MT_04A0110002).

  1. Install Scientific Linux X86_64 with default packages.

  2. $ lspci | grep InfiniBand

81:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0)

  1. Following the installation instructions of OFED

[root@localhost MLNX_OFED_LINUX-1.5.3-3.1.0-rhel6.3-x86_64]# ./mlnxofedinstall --all

Device (81:00.0):

81:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

Link Width: 8x

PCI Link Speed: 2.5Gb/s

Installation finishe@d successfully.

-E- Can not open /dev/mst/mt25418_pci_cr0: MFE_CR_ERROR

-E- Can not open /dev/mst/mt25418_pci_cr0: MFE_CR_ERROR

-E- Can not open /dev/mst/mt25418_pci_cr0: MFE_CR_ERROR

There is no firmware found for /dev/mst/mt25418_pci_cr0.

Configuring /etc/security/limits.conf.

Please reboot your system for the changes to take effect.

  1. Reboot the blade.

#ibv_devinfo

No IB devices found

And some other errors “corrupted device ID 0xffffffff” “HW/PCI access problem”

  1. Download and update the fireware manually.

mstflint -d 81:00.0 -i fw-25408-2_9_1000-MHGH28-XTC_A2-A3.bin burn

Success. (mlxburn -dev /dev/mst/mt25418_pci_cr0 will fail, error: device ID 0xffffffff).

  1. Reboot the blade.

lspci | grep InfiniBand

Nothing return! I can not find the hardware any more!

Please give me some advices on this problem. Many thanks for your time and help.

Best Regards,

Sang

Hi Sang,

I am not sure what happened to your card but it sound like the process misidentified it, resulting with bricking the card.

But don’t lose your hopes. i know that the folks from Mellanox support can bring back to life some of those cases.

Please open a support ticket with Mellanox support (email mailto://support@mellanox.com/ or web http://support.mellanox.com/ ) and somebody will give a hands.

Good luck!

Dear yairi,

I have successfully (at least the program told me so) upgrade the fireware

and identification did exist. I do not understand why it told me "corrupted

device ID 0xffffffff" I thought because the fireware was too old to the

drivers.

However, it is brick now.

The website asked for the serial number of the product and I am afraid I

can not provide it now since I am not the buyer of the blade server. I have

sent the email and hoping there will be some responses.

Thank you very much for your help.

Best,

Sang

2013/3/10 yairi <johns@mellanox.com mailto:johns@mellanox.com >

**

Mellanox Interconnect Community

<http://community.mellanox.com/?et=watches.email.thread http://community.mellanox.com/?et=watches.email.thread > Installation

and fireware update problems

reply from yairi<http://community.mellanox.com/people/yairi?et=watches.email.thread http://community.mellanox.com/people/yairi?et=watches.email.thread >in

InfiniBand/VPI Atadpter Cards - View the full discussion<http://community.mellanox.com/message/1104?et=watches.email.thread#1104 http://community.mellanox.com/message/1104?et=watches.email.thread#1104 >

Hi Sang,

Did you get the cards working?

  • Justin

Dear Justinclift,

I am afraid not yet. Since one IB HCA on the blade server is brick now, I think I should reflash the default firmware to the HCA somehow. However, I can not find a simple solution for this, such as a jumper on the HCA, OEM device by DAWNING, used for loading the HCA with the default firmware burnt on the HCA’s flash. There is no IB devices in the PCI list now. lspci | grep InfiniBand returns nothing.

I tried to install other operation systems on another node, Redhat 6 and 5. However, it failed to load the IB driver.

The firmware version 2.5.8, I do not know if I can use these IB HCA with OFED 1.5 or Redhat 6/5.

I despair of the upgrading program. One HCA lost is enough.

Best Regards

Sang

Hi,

All the cards still physically in the server but the operation system does

not agree with it.

I have no idea how to fix the brick node or install OFED on the others

survivors nodes.

: (

Sang

2013/3/22 justinclift <johns@mellanox.com mailto:johns@mellanox.com >

**

Mellanox Interconnect Community

<http://community.mellanox.com/?et=watches.email.thread http://community.mellanox.com/?et=watches.email.thread > Re:

Installation and fireware update problems

reply from justinclift<http://community.mellanox.com/people/justinclift?et=watches.email.thread http://community.mellanox.com/people/justinclift?et=watches.email.thread >in

InfiniBand/VPI Atadpter Cards - View the full discussion<http://community.mellanox.com/message/1183?et=watches.email.thread#1183 http://community.mellanox.com/message/1183?et=watches.email.thread#1183 >

As an idea, do you have a normal PC or non-blade server around, with a free PCIe slot (PCIe x16 or PCIe x8)?

If you do, then it might be a better idea to update the firmware in your other cards with that instead of in the blade server.

I can give you the exact instructions for updating the firmware in a non-blade server (using either RHEL or CentOS versions 6.3 or 6.4). I have very similar cards here. Also MHGH28-XTC, but a different hardware revision (mine are MT_04A0120002, so different firmware needed).

After the firmware is updated, you could then see if the cards work properly in the blade server with Scientific Linux.

Btw, with the PSID for your cards, are you reading it from the sticker on the back of the card or did you get it from somewhere else? Just wondering if you might have gotten it from the wrong place, and therefore downloaded the wrong firmware. (unsure)

Hopefully this is helping.