IPoIB not working on Windows 2008 r2 - need help

I’m trying for the first time to get IPoIB working on one of our Windows servers. Details:

InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]

Windows Server 2008 r2

MLNX_WinOF_VPI_2_1_2_win7_x64.msi (as recommended by the mellanox download page for InfiniHost III adapters)

I don’t notice any errors, the adapter shows up fine and I can configure it with a static IP address. After configuring it (or after boot) I can ping it from another machine for about 10 seconds before it stops responding. When I ping out from the machine, the icmp packets are being sent out the main ethernet interface (which is a different IP network) and I can see them get to our router. ibdiagnet does not report any errors. ipconfig and netstart -r seem fine.

I see the following in my opensm log:

Jun 06 11:51:50 600282 [29FC1700] 0x02 → log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:2 GID:fe80::1:6:6a00:d800:242

Jun 06 11:51:51 011771 [1F5B0700] 0x02 → osm_ucast_mgr_process: minhop tables configured on all switches

Jun 06 11:51:51 016889 [1F5B0700] 0x02 → log_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::1:5:ad00:c:5ced

Jun 06 11:51:51 016899 [1F5B0700] 0x02 → state_mgr_report_new_ports: Discovered new port with GUID:0x0005ad00000c5ced LID range [16,16] of node: Topspin DDR-HCAe LX x8

Jun 06 11:51:51 027491 [1F5B0700] 0x02 → SUBNET UP

Jun 06 11:51:56 333829 [213B3700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:1405:ffff::3333:0:1

Jun 06 11:51:56 333875 [209B2700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:1405:ffff::3333:ff76:9ac6

Jun 06 11:51:56 603980 [295C0700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:1405:ffff::3333:0:2

Jun 06 11:51:56 604270 [245B8700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:1405:ffff::3333:0:16

Jun 06 11:52:15 854497 [263BB700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:1405:ffff::3333:1:2

Jun 06 11:52:15 857261 [213B3700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:1405:ffff::3333:1:3

Jun 06 11:52:15 857968 [209B2700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::16

Jun 06 11:52:15 963577 [21DB4700] 0x02 → log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:1 GID:ff12:401b:ffff::fc

Jun 06 12:04:56 535293 [26DBC700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:0:1 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 535870 [277BD700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:1:3 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 535908 [259BA700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:1:2 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 535942 [23BB7700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:401b:ffff::fc for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 535970 [281BE700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:401b:ffff::16 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 536014 [277BD700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:ff76:9ac6 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 536042 [209B2700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:0:16 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:04:56 536634 [227B5700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:0:2 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:06:29 959894 [295C0700] 0x01 → mcmr_rcv_join_mgrp: ERR 1B13: validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX x8), sending IB_SA_MAD_STATUS_REQ_INVALID

Jun 06 12:06:29 960518 [231B6700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:1405:ffff::3333:1:2 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:06:36 629355 [26DBC700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:401b:ffff::1 for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:06:36 629416 [259BA700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:401b:ffff::ffff:ffff for PortGID: fe80::5:ad00:c:5ced

Jun 06 12:06:36 638659 [21DB4700] 0x01 → mcmr_rcv_join_mgrp: ERR 1B13: validate_modify failed from port 0x0005ad00000c5ced (Topspin DDR-HCAe LX x8), sending IB_SA_MAD_STATUS_REQ_INVALID

Jun 06 12:06:36 638853 [245B8700] 0x01 → mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request for MGID: ff12:401b:ffff::ffff:ffff for PortGID: fe80::5:ad00:c:5ced

This last message repeats quite a bit within that second, and then stops.

The issue turned out to be that the Windows driver cannot handle non-standard GID prefixes. Reverting to using the standard GID prefix allowed it to work.

Thanks for that pointer. Unfortunately it appears that the card is actually a Cisco one:

ID: Cheetah DDR

PN: SFS-HCA-320-A1

which apparently complicates determining the appropriate firmware for the device since flint does not return a PSID for this device…

Looks like I’m going to need an old version of mstflint from somewhere to determine for sure. A similar card in a linux box is reporting version 1.1.000, which apparently is old (the driver says that 1.2.000 is current). Maybe I’ll try updating if I can find a working version of mstflint and the firmware file somewhere. Although updating the firmware in a MT25208 InfiniHost III Ex card made it stop working on our linux machines.

As an additional idea, if you do a “lspci -Qvvs” on the card, some models return the actual engineering part number and revision they use (regardless of OEM branding).

It’s worth trying out, even just for info purposes. Learned it from another member here not long ago (very useful).

(note - edited for typo fixes)

On you initial posting you mentioned something around…“ping is working for few seconds and then stops”

and then you also mentioned that you CAN see it in your Ethernet network and your (other IP range) router… so from all of that i take it that your setup looks like:

some IB nodes → IB to Eth Gateway → L3 Router → other subnet

Is this correct?

If yes, what kind of gateway are you using? (BridgeX or 4036E or is it something else).

Also, it would be good to take a look at your L3 router logs and see what the flow of things was. you might see something like: IGMP was kick-starting few MC groups, then they all dropped (along with the reason we all want to hear about ;-) )

What do you see my friend?

I’m not a windows guy, so I can’t answer your question directly. Just wanting to ask what the firmware version is on that card, as it might be useful info for the other people around (who can answer).

Doing some Googling turned up this:

Software Error Software Error

(translate.google.com does ok with it)

That seems to indicate the card is an MHGS18-XTC. Looking at the Mellanox firmware table and reading the specs there vs what’s the specs on the Cisco site, it seems like it could also be a MHES18-XTC:

Looking at the card itself, do any of the stickers printed on it have useful info? The “Revision” thing can sometimes be useful too. If I had to guess, I’d go with the model mentioned on that japanese page as a first go.

How many cards do you have, and how much of a problem would it be if one of them dies? (eg badly wrong firmware) It could come down to this.

This is probably the link you want then:

http://old.mellanox.com/content/pages.php?pg=management_tools&menu_section=34 http://old.mellanox.com/content/pages.php?pg=management_tools&menu_section=34

That’s the older Mellanox website, prior to their recent refresh.

If the 2.7.1a version of mstflint doesn’t do what you need, let me know. I downloaded most of the older flint versions several hours ago, as they’re all actually on that same site… you just have to muck around with the path name to use the older version number and voila, they download. (previous version numbers can be found in the release notes)

Hope that helps. I used to use the InfiniHost III Ex cards a few years ago with RHEL/CentOS 6.0. They seemed fine for what I was doing.

Haven’t used them recently though and I don’t have any here to try stuff out on. Good luck though.

(note - edited for typo fix)

I burned it with the

MHGS18-XSC Rev A1:

MT_03D0110002

firmware. It didn’t destroy anything but it didn’t help anything either. It seems like something is messed up with the IPoIB driver on windows.