Hello Martijn
Thanks a lot for quick answering.
Our scenario is this:
We produce systems with three computers in it.
COMPUTER1 - - COMPUTER2 = = COMPUTER3
-
- is the first IB connection
= = is the second IB connection
We use IPoIB
My problem is that I don’t understand:
How do both subnet managers on my COMPUTER2 know that they are in different Subnets?
Like in an IP network, I would expect a configuration or a routing table
In our case COMPUTER1 and COMPUTER3 do not communicate. But could they? Could COMPUTER2 act as Router?
I investigate this because sometimes the connection between COMPUTER2 and COMPUTER3 doesn’t come up.
In the log we see ib0 comes up
Aug 31 12:17:13 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
whereas ib1 stays in not ready
Aug 31 12:17:13 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready
While exanimation IB I used IB tools like ibnetdiscover, ibhosts ….
I saw On COMPUTER2 they dumped only information about the first connenction and ended with a timeout.
( For example [root@COMPUTER2]$ ibhosts
src/query_smp.c:195; umad (DR path slid 0; dlid 0; 0,1 Attr 0xff90:1) bad status 110; Connection timed out
Ca : 0xe41d2d030047c3a0 ports 2 " COMPUTER1"
Ca : 0x506b4b03004e4c00 ports 2 " COMPUTER2" )
I tried to use the GUIDs of each port:
Fore example:
With the GUID of the first port everything is fine.
[root@xct-eds hitrax]$ ibaddr -G 0x506b4b03004e4c01
GID fe80::506b:4b03:4e:4c01 LID start 0x1 end 0x1
With the GUID of the second port we fail.
[root@xct-eds hitrax]$ ibaddr -G 0x506b4b03004e4c02
ibwarn: [107057] ib_path_query_via: sa call path_query failed
ibaddr: iberror: failed: can’t resolve destination port 0x506b4b03004e4c02
In the opensm.log I do see similar timeouts:
Sep 08 05:23:50 069680 [9D0EA700] 0x01 → mcmr_rcv_join_mgrp: ERR 1B11: Port 0xe41d2d030047c3a1 (rcc HCA-1) failed to join non-existing multicast group with MGID ff12:601b:ffff::2, insufficient components specified for implicit create (comp_mask 0x10083)
Sep 08 05:23:52 579650 [9D0EA700] 0x01 → mcmr_rcv_join_mgrp: ERR 1B11: Port 0xe41d2d030047c3a1 (rcc HCA-1) failed to join non-existing multicast group with MGID ff12:601b:ffff::16, insufficient components specified for implicit create (comp_mask 0x10083)
Sep 08 05:23:54 069575 [9D0EA700] 0x01 → mcmr_rcv_join_mgrp: ERR 1B11: Port 0xe41d2d030047c3a1 (rcc HCA-1) failed to join non-existing multicast group with MGID ff12:601b:ffff::2, insufficient components specified for implicit create (comp_mask 0x10083)
Sep 08 05:23:56 805946 [9B0E6700] 0x01 → log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) – dropping
Method 0x1, Attr 0xFF90, TID 0x1256
Sep 08 05:23:56 805990 [9B0E6700] 0x01 → Received SMP on a 1 hop path: Initial path = 0,1, Return path = 0,0
Sep 08 05:23:56 806006 [9B0E6700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(MLNXExtendedPortInfo), attr_mod 0x1, TID 0x1256
Sep 08 05:23:56 806018 [9B0E6700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3120: Timeout while getting attribute 0xFF90 (MLNXExtendedPortInfo); Possible mis-set mkey?
Sep 08 05:23:56 806732 [9C0E8700] 0x02 → SUBNET UP
Sep 08 05:23:58 069294 [9D0EA700] 0x01 → mcmr_rcv_join_mgrp: ERR 1B11: Port 0xe41d2d030047c3a1 (rcc HCA-1) failed to join non-existing multicast group with MGID ff12:601b:ffff::2, insufficient components specified for implicit create (comp_mask 0x10083)
Sep 08 05:24:06 805934 [9B0E6700] 0x01 → log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) – dropping
Method 0x1, Attr 0xFF90, TID 0x125c
Sep 08 05:24:06 805983 [9B0E6700] 0x01 → Received SMP on a 1 hop path: Initial path = 0,1, Return path = 0,0
Sep 08 05:24:06 806001 [9B0E6700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(MLNXExtendedPortInfo), attr_mod 0x1, TID 0x125c
Sep 08 05:24:06 806012 [9B0E6700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3120: Timeout while getting attribute 0xFF90 (MLNXExtendedPortInfo); Possible mis-set mkey?
Sep 08 05:24:06 806968 [9C0E8700] 0x02 → SUBNET UP
These timeouts seem not to cause problems in most cases. But to me it seems that our computers are bad configured.
And every fiftieth machine we do produce has the problem that the second port on COMPUTER2(ib1) does not come up. Normally we simply do exchange the computers.
Thus my first idea was: That we could have a kind of routing problem which affects MAD(Management Datagram) . Thus the subnet manager is confused
The really important question to understand is, how the both Subnet Mangers on COMPUTER2 now that they are in different Subnets?
Many Thanks
Maik