We are testing out a torus type topology in my environment and ran into an issue. We have 2 currently in testing, 1 that works which is a 4x2x1 and a second that doesn’t, configured as a 7x2x1. The 7x2x1 experiences credit loop problems when validating with ibdmchk
(ofed 1.5.3.4.0.12)
Using ibsim I replicated the credit loop problem with a virtual subnet configured with the same number of CA and switches. I reconfigured it as a 4x3x1 with the 7th cabinet disconnected and I think we can work with that. I thought the problem may have stemmed from having the Y axis only 2 switches tall, so the “ym_link” and “yp_link” had the same source and destination.
Out of curiosity I created a 36x3x1 fabric, 36 port switches, 16 CAs per switch. Here is the torus-2QoS.conf file. I used Switch73 as coordinates 0x0x0. This still causes credit loops and I have NO IDEA why. Any thoughts? I’m using ibsim 0.5 for the tests. The topology is created via a perl script I wrote.
Switch1 => Switch2 => Switch3 => …
Switch37 => Switch38 => Switch39 => …
Switch 73 => Switch74 => Switch75 => …
torus 36t 3t 1
xp_link 0x200048 0x200049
xm_link 0x200048 0x20006b
yp_link 0x200048 0x200000
ym_link 0x200048 0x200024
portgroup_max_ports 20
Here is the output from opensm.log upon starting it up:
[root@cinhpcdev4 torus-test2]# cat opensm.log
Nov 07 12:10:32 997544 [9E1C3780] 0x43 → OpenSM 3.3.13.MLNX_20130110_cd124d3
Nov 07 12:10:32 999452 [9E1C3780] 0x80 → OpenSM 3.3.13.MLNX_20130110_cd124d3
Nov 07 12:10:33 275154 [9E1C3780] 0x02 → osm_vendor_init: 100 pending umads specified
Nov 07 12:10:33 294725 [9E1C3780] 0x80 → Entering DISCOVERING state
Nov 07 12:10:33 331269 [9E1C3780] 0x02 → osm_vendor_bind: Binding to port 0x200000
Nov 07 12:10:33 432034 [9E1C3780] 0x02 → osm_vendor_bind: Binding to port 0x200000
Nov 07 12:10:33 449985 [9E1C3780] 0x02 → osm_vendor_bind: Binding to port 0x200000
Nov 07 12:10:33 468021 [9E1C3780] 0x02 → osm_opensm_bind: Setting IS_SM on port 0x0000000000200000
Nov 07 12:10:34 637631 [CF49E940] 0x80 → Entering MASTER state
Nov 07 12:10:34 637692 [CF49E940] 0x01 → osm_prtn_make_partitions: Partition configuration ./partitions.conf is not accessible (No such file or directory)
Nov 07 12:10:37 256120 [CF49E940] 0x02 → torus_build_lfts: Found fabric w/ 2592 links, 108 switches, 1728 CA ports, minimum 8 data VLs
Nov 07 12:10:37 256148 [CF49E940] 0x02 → torus_build_lfts: Looking for 36 x 3 x 1 torus
Nov 07 12:10:37 256165 [CF49E940] 0x02 → build_torus: Using torus seed configured as default (seed sw 0,0,0 GUID 0x200048).
Nov 07 12:10:37 257829 [CF49E940] 0x02 → torus_build_lfts: Built 36 x 3 x 1 torus w/ 2592 links, 108 switches, 1728 CA ports
Nov 07 12:10:37 309800 [CF49E940] 0x02 → osm_ucast_mgr_process: torus-2QoS tables configured on all switches
Nov 07 12:10:37 309874 [CF49E940] 0x01 → osm_qos_parse_policy_file: ERR AC01: Failed opening QoS policy file ./qos-policy.conf - No such file or directory
Nov 07 12:10:45 156308 [CF49E940] 0x02 → SUBNET UP
Nov 07 12:10:45 163642 [9E1C3780] 0x80 → Exiting SM
And here is the relevant output from ibdmchk -s ./opensm-subnet.lst -f ./opensm.fdbs -m ./opensm.mcfdbs -d ./opensm-sl2vl.dump
-I- Scanning all multicast groups for loops and connectivity…
-I- Using full credit loop check.
-I- Analyzing Fabric for Credit Loops 1 SLs, 8 VLs used.
-I- Traced 2984256 unicast paths
-E- Credit loop found on the following path:
S0000000000100000/N0000000000100000/P1 VL: 0 on path from lid: 0x0002 to lid: 0x03a8
S0000000000200000/N0000000000200000/P21 VL: 0 on path from lid: 0x0002 to lid: 0x03a8
S0000000000200023/N0000000000200023/P21 VL: 0 on path from lid: 0x0734 to lid: 0x0196
S0000000000200022/N0000000000200022/P21 VL: 0 on path from lid: 0x061a to lid: 0x0156
…