MHGH28-XTC not working

Thank you both for your positive comments. We are here to help and we are very happy that you are pleased with our efforts.

Like I said newbie when it comes to routing / ip forwarding how would I get that to work?

Cheers Todd.

I have finally got SRP working between ESXi 5.1 and Solaris and have put on Windows SBS 2011 Essentials on another box. The drives for Win 2008r2 do have a SRP miniport but it is throwing up the (unable to start) error, the Subnet Manager etc are all starting fine.

One post I came across suggested that if you have more than 2 targets then the Windows miniport will fail to start. I have 4 . I will probably have to remove a couple and then try again but after 3 painful days to get to this stage, I will leave it for another day this coming week.

I also agree, the Mellanox guys have been very proactive and helpful which is pretty rare and to be applauded.

Hmm,

I seem to be hitting a wall again.

The above mentioned flashes worked fine for my two Rev A1 cards but my two rev A2 cards just sit there taking 90% cpu and appear to be doing nothing. There is no text output after I type the command in and hit return.

mft status reports as you would expect (see posts above) but mlxburn and flint both just sit there with a blank line and 90% cpu.

flint -d device_name -q reports that it cannot get semaphores (63)

I have run flint -clear-semaphore -d device_id which seems to complete fine but mlxburn and flint for flashing just sits there doing nothing when run after.

I will try the exact command Todd used for his A2-A3 card but I suspect it is the same as I have been running already.

Update: No luck with the command Todd used.

I also see a lot of errors (IB_Timeouts) in the /vat/log/opensm.log on Linux which is running OpenSM. THis one jumps out at me (see attachment for them all).

Apr 17 21:00:06 008048 [7A680700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3120 Timeout while getting attribute 0x15 (PortInfo); Possible mis-set mkey?

Suggestions ?.

@rimblock - As a curiosity question, are you definitely using the A2/A3 firmware? It’s a different firmware download than the A1:

http://www.mellanox.com/downloads/firmware/fw-25408-2_9_1000-MHGH28-XTC_A2-A3.bin.zip http://www.mellanox.com/downloads/firmware/fw-25408-2_9_1000-MHGH28-XTC_A2-A3.bin.zip

Just in case you forgot.

Yeah, spotted that .

What I now have to do is track down which cards are reporting the possible mis-set key and I suspect it will be the A1s that were updated by flint.

A mixture of cards and revisions is really not helping .

Port 1 - Solaris 11.1 - ConnectX-2 (SAN)

Port 2 - ESXi 5.1: ConnectX A1 (flashed)

Port 3 - ESXi 5.1: ConnectX A2 (unflashed)

Port 4 - CentOS 6.4 (OpenSM): ConnectX A2 (unflashed)

Port 5 - ESXi 5.1: ConnectX A1 (flashed)

I have just changed the port 2 connectx A1 card for an A2 card to see if that makes a difference.

The current setup seems to be working and is this…

Port 1 - Solaris 11.1 - ConnectX-2 (SAN)

Port 2 - ESXi 5.1: ConnectX A2 (unflashed)

Port 3 - ESXi 5.1: none

Port 4 - CentOS 6.4 (OpenSM): ConnectX A1 (flashed)

Port 5 - ESXi 5.1: ConnectX A2 (unflashed)

So, the two core ESXi 5.1 hosts have ver A2, the OpenSM box has a ver A1 and the san has a X-2.

I had to remove the Mellanox ESXi vib and then reinstall it as it did not ‘see’ the A2 card. After the reinstall the card popped up and the targets were available.

It also seems that ver A1 will not work with VT-d (passthrough). After assigning the card for passthrough on the ESXi host, rebooting and then adding the card to the Windows 2012 Ess VM as a PCI devce, the VM will not start. It produces a caught error (at least not a PSOD). I have not yet tried with the A2 version.

I’m all about simple. Any suggestions for addressing?

DC1 10.0.0.1 DC2 10.0.0.2

RAID1 10.0.0.3 RAID2 10.0.0.4

Win7-1 10.0.0.5 Win7-2 10.0.0.6

addressing isn’t my forte, or they they have to be on different subnets

Ok, it appears that the 2.9 firmware is not playing nice with the ConnectX cards in ESXi servers.

It is resulting in the “IB_Timeout” and “mkey incorrectly set” errors. That explains the results by swapping the cards above. The 2.7 firmware does seem to work though.

Thanks to Chuckleb (at Serve The Home forums) for the results of his investigation.

RB

No worries about the time gap. We all get super busy at times and prioritise, etc.

With the problem you’re experiencing, a fundamental bit of info is that OpenSM only attaches to one port when it runs. By default, the first one it finds in a server (can be overridden in config file).

The way to think about it is that OpenSM starts up and locates the first Infiniband port, then explores/maps the network topology by finding whatever it can through that one port.

The reason I’m emphasising the “through that one port” bit, is to try and highlight that OpenSM won’t see or recognise any of the other ports in that same server (unless there’s an Infiniband switch in place to let the first port see the other ports).

One way to get around this is to have your all of your 3 nodes cabled from port 1 (on one box) to port 2 (on the next box), then run OpenSM on them all. That way all ports will come up and be active.

I do this with a 2 node setup (port 1 of each box connecting to port 2 of the other, OpenSM running on both), then I run IPoIB over the top and set up individual IP subnetting for each group of ports so IP connectivity “just works”.

I haven’t yet tried it with a 3 node setup, but probably will do in a few weeks after I’m back in the UK.

Does this help?

(note - edited to improve clarity a bit)

So I have it wired like this

DC port 1 to RAID port 1

DC port 2 to Win7 port 1

RAID port 2 to Win7 port 2

I should rewire it

DC port 1 to RAID port 2

RAID port 1 to Win7 port2

Win7 port 1 to DC port 2

and run opensm on all the machines?

Pretty much.

Well, that’s the easiest way to make all ports active.

There are other options, such as running more than one OpenSM on a box + manually telling each one which port to use, but they’re more of a pita to set up.

Hopefully the simple approach works well enough for you.

Put them on different subnets. Try this first:

DC port 1 (10.0.0.1) → RAID port 2 (10.0.0.2)

RAID port 1 (10.0.1.1) → Win 7 port 2 (10.0.1.2)

Win 7 port 1 (10.0.2.1) → DC port 2 (10.0.2.2)

Netmask of 255.255.255.0 for everything.

I think that’ll work. May need to enable IP forwarding too though (unsure).