ConnectX-3 not going up in Centos 6.4 (and SL6.4)

Hi all,

I’m building a stateless cluster using ConntectX-3 and warewulf for management, but I’m having a hard time making the adapter up. I have build a chroot image with the infiniband packages looks like the adapter is not being initialised (this is in all compute nodes, so probably not a hardware issue).

root@geomechanics fcanesin]# ssh n00 [root@n00 ~]# clear [root@n00 ~]# hca_self_test.ofed ---- Performing Adapter Device Self Test ---- Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-2.0-2.0.5 (OFED-2.0-2.0.5): 2.6.32-358.el6.x86_64 Host Driver RPM Check .................. PASS Firmware on CA #0 VPI .................. v2.10.4700 Firmware Check on CA #0 (VPI) .......... NA REASON: NO required fw version Host Driver Initialization ............. PASS Number of CA Ports Active .............. 0 Kernel Syslog Check .................... PASS Node GUID on CA #0 (VPI) ............... NA ------------------ DONE --------------------- [root@n00 ~]# ibstat [root@n00 ~]# ifup ib0 Device ib0 does not seem to be present, delaying initialization. [root@n00 ~]# cat /var/log/dmesg | grep ml Command line: ro initrd=bootstrap/51/initfs.gz wwhostname=n00.cluster wwkmods=ipv6,ib_addr,ib_core,ib_mad,ib_sa,ib_,ib_umad,iw_cm,rdma_cm,rdma_ucm,mlx4_core,mlx4_ib,ib_mthca,ib_ipoib wwmaster=10.0.0.254 wwipaddr=10.0.0.100 wwnetmanetdev=eth0 BOOT_IMAGE=bootstrap/51/kernel Kernel command line: ro initrd=bootstrap/51/initfs.gz wwhostname=n00.cluster wwkmods=ipv6,ib_addr,ib_core,ib_mad,ib,ib_ucm,ib_umad,iw_cm,rdma_cm,rdma_ucm,mlx4_core,mlx4_ib,ib_mthca,ib_ipoib wwmaster=10.0.0.254 wwipaddr=10.0.0.100 55.0 wwnetdev=eth0 BOOT_IMAGE=bootstrap/51/kernel Compat-mlnx-ofed backport release: gcecc987 mlx4_core: Mellanox ConnectX core driver v1.1 (Jun 12 2013) mlx4_core: Initializing 0000:04:00.0 mlx4_core 0000:04:00.0: PCI INT A -> GSI 32 (level, low) -> IRQ 32 mlx4_core 0000:04:00.0: setting latency timer to 64 mlx4_core 0000:04:00.0: command INIT_HCA (0x7) failed: fw status = 0x3 mlx4_core 0000:04:00.0: INIT_HCA returns -22 mlx4_core 0000:04:00.0: INIT_HCA command failed, aborting. mlx4_core 0000:04:00.0: PCI INT A disabled mlx4_core: probe of 0000:04:00.0 failed with error -22 [root@n00 ~]# lsmod | grep ib mlx4_ib 154125 0 mlx4_core 233054 2 mlx4_en,mlx4_ib libsas 74168 1 isci scsi_transport_sas 35620 2 isci,libsas ib_umad 12538 0 ib_ucm 12120 0 ib_uverbs 40038 2 rdma_ucm,ib_ucm ib_ipoib 109448 0 ib_cm 41480 3 rdma_cm,ib_ucm,ib_ipoib ib_sa 24010 5 rdma_ucm,rdma_cm,mlx4_ib,ib_ipoib,ib_cm ib_mad 43081 4 mlx4_ib,ib_umad,ib_cm,ib_sa ib_core 80859 12 rdma_ucm,rdma_cm,mlx4_ib,iw_cm,ib_umad,ib_ucm,ib_uverbs,ib_ipoib,ib_cm,ib_ ib_addr 5900 1 rdma_cm compat 18042 17 mlx4_en,rdma_ucm,rdma_cm,mlx4_ib,mlx4_core,iw_cm,ib_umad,ib_ucm,sa,ib_mad,ib_core,ib_addr ipv6 321422 88 ib_ipoib,ib_addr [root@n00 ~]# lspci | grep Mell 04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] [root@n00 ~]#

This is with MLNX_OFED as you can see, I have tested using yum groupinstall “Infiniband Support” … exactly thee same problem…I’m without ideas. Help!?

Interesting, the product sheet for that on the SuperMicro site definitely lists RHEL as being supported, so CentOS should work.

I’m guessing the firmware on the card is the latest from the SuperMicro site? It’s hard to tell, as their downloads don’t use numbers that correspond to firmware versions.

As a thought, the “Drivers” folder on the SuperMicro site has only one Linux download… Mellanox OFED 1.5.3 for an ancient version of Oracle Enterprise Linux.

Are you able to try Mellanox OFED 1.5.3 series for your CentOS install, instead of Mellanox OFED 2.x? Not sure it’ll work, but it’s worth a shot.

If that doesn’t work, then this is out of my depth and the proper Mellanox guys will probably need to look at this.

Great !! All adapters are in FDR mode and with latest drivers now!! \o/ …

worked with “flint -d /dev/…/…_cr0 dc” and after that mlxburn.

Oh, another useful util is to run “lspci” to query the pci card slot and find out exactly what it’s seeing. Plenty of examples on this site if you haven’t seen them already.

Managed to install, but … same problem:

mlnxofed-docs ##################################################

Device (04:00.0):

04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Link Width: 8x

PCI Link Speed: Unknown

Installation finished successfully.

Error: Firmware configuration file for /dev/mst/mt4099_pci_cr0 is not found

Skip firmware update for /dev/mst/mt4099_pci_cr0.

Configuring /etc/security/limits.conf.

Please reboot your system for the changes to take effect.

[root@n02 mnt]# hca_self_test.ofed

[root@n02 mnt]# hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-1.5.3-3.1.0 (OFED-1.5.3-3.1.0): 2.6.32-358.el6.x86_64

Host Driver RPM Check … PASS

Firmware on CA #0 VPI … v2.10.4700

Firmware Check on CA #0 (VPI) … NA

REASON: NO required fw version

Host Driver Initialization … PASS

Number of CA Ports Active … 0

Kernel Syslog Check … PASS

Node GUID on CA #0 (VPI) … NA

------------------ DONE ---------------------

[root@n02 mnt]# ibstat

[root@n02 mnt]#

But noticed that the system did not recognized the PCI-E interface during the installation (in bold), what could this mean ?? Maybe some BIOS configuration ?

Supermicro AOC-CIBF-M1 ConnectX-3 FDR card.

Hi canesin,

With Supermicro you should try to update using flint. They should normally provide you with .ini file anyway?

Anyway flint command something like:

flint -d <devicename_pci_cr0> -i -guid -nofs b

Try without the -nofs parameter first if it complains then you can try the -nofs.

here we go! now we are getting somewhere!

can you please post the output of ibstat command (or ibv_devinfo)

who did you buy those cards from (once you tell me that, i will tell you if you are able to get a newer firmware).

FDR Vs FDR10 - things that could happen:

  • old FW - you always want to be on the latest for both your HCAs and the switches

  • the correct cable - make sure your cable it certified for FDR.

[root@n02 MLNX_OFED]# ./mlnx_add_kernel_support.sh -i /root/MLNX_OFED_LINUX-1.5.3-3.1.0-rhel6.3-x86_64.iso

ERROR: Linux Distribution (centos-release-6-4.el6.centos.10.x86_64) is not supported

Yeah, it could be that.

This is definitely past my level of knowledge. You’ve done everything possible software wise, so if you can it’d be a good idea to start playing around with BIOS settings to see if that changes things.

Now might also be a good time to contact SuperMicro support and ask them if there’s anything special that needs to be done, for the IB adapter to work. (guessing you’ve scanned their manuals for that already?)

wow…Getting there … now I have the cards seeing each other … but some are in FDR10 mode… how do I set this ones to FDR ?

Cool.

Which firmware is it now running, that’s working?

Asking so the next person that has issues with this model of SuperMicro card has more data to work with.

As a very first thought, what’s the exact model of card you’re using?

So its all working now?

Good point, I remember hitting that too. It was easy to solve though, I just had to edit the mlnx_add_kernel_support.sh script to accept CentOS 6.4, at which point it then runs through fine and does everything it needs to.

Editing the script is simple. Use vi or something, and look for strings about “6.3”. Add one for 6.4, and have it think its running on rhel6.3.

That should work fine. (in theory)

Excellent. Now you’re ready to rock and roll.

DID it!!

Firmware was older than supported in OFED/MLNX_OFED (2.11) … if someone needs I can pass the .iso created (or has a server disponible to be avaliable).

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-1.5.3-3.1.0 (OFED-1.5.3-3.1.0): 2.6.32-358.el6.x86_64

Host Driver RPM Check … PASS

Firmware on CA #0 VPI … v2.10.4700

Firmware Check on CA #0 (VPI) … NA

REASON: NO required fw version

Host Driver Initialization … PASS

Number of CA Ports Active … 0

Port State of Port #1 on CA #0 (VPI)… DOWN (InfiniBand)

Error Counter Check on CA #0 (VPI)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (VPI) … 00:25:90:ff:ff:07:f4:50

------------------ DONE ---------------------

yeap

Running 2.10.4700, was not able to upgrade the firmware - when I run mlxburn it says it don’t have the configuration for it (brc/ini).

I think I also have a problem with the switch now… =/ … The LEDs for FAN and PS1 are not green …

see if this post helps:How to change ConnectX-2/3 VPI port type https://community.mellanox.com/s/article/howto-change-port-type-in-mellanox-connectx-3-adapter

It might be that you have a VPI card that is set to work in Ethernet mode. try to flip it over to IB.