Problem configuring and using IPoIB

I have IB connecting two nodes, and if i run ibhosts i can see both nodes GUID but when I run ibqueryerrors it says “2 bad nodes found” and “2 ports have errors beyond thresholds”. I have given the nodes the IP addresses 10.0.0.0 and 10.0.0.1 in my if cfg-ib0 and when i try and ping one from the other it times out. Looking in the log the other node has been assigned a LID, and there is no indication of an error but the messages log does not say SUBNET UP. I’m new to IB and don’t really know what the error might be. Has anyone encountered this error before/knows where I can find more info?

ibnetdiscover

ibdiagnet -r

sminfo

saquery

saquery -s

After messing with it again last night, I seem to have IPoIB configured as the MTU dropped to 2044 and the IPs that I assigned each node in their ifcfg-ib0 show up when i run “ifconfig ib0”. Now I see the SUBNET UP message in my log but I still don’t have connection. When I run ping 10.0.0.1 it times out, and when I try ibping -G it times out as well.

I ran osmtest and got all of these errors,

and even a few more above that I couldn’t fit into the screen shot.

My opensm.log on the head node running opensm shows that the subnet was up, but a number of errors showed up once I ran osmtest.

I can’t really find anything in my setup that looks like an error, or anything that indicates where the error is. I am using older HCAs 26428s and when I ran mlnxofedinstall it said it couldn’t update the firmware. Is this potentially a firmware issue?

I installed flexboth nodes HCA as well if it makes any difference.

Thanks,

Trevor

Hi Trevor,

can you send the following output:

  1. ibnetdiscover

  2. ibdiagnet -r (all files under /var/tmp/ibdiagnet2/)

  3. sminfo

  4. saquery

  5. saquery -s

Hi Trevor,

are you still experiencing issues with IPoIB?

if yes - can you please provide the opensm configuration file you are using (if exists) and the opensm command line you used in order to start opensm.

Please also provide the output of:

saquery -g

saquery -m

and I’ve just checked, and I have the most recent FW on the mellanox site for an HT26428 (2.9.1000)

Does flex boot cause issues with trying to use IB normally? ibdiagnet seems to indicate errors in the FW but the most recent FW I found on the mlnx site is 2.9.100, and that is what my cards say they have when i run ibv_devinfo. There are also seem to be issues with routing but I’m assuming that is caused by the issues with the firmware…

Thanks,

Trevor

yes, I am still having this problem. Although at this point I believe my IPoIB config is correct and it seems to be a firmware issue.

my opensm folder in /etc is empty so it looks like I do not have an opensm config file. If you could tell me if my IPoIB config is correct that would be very helpful.

Thanks,

Trevor