Testing sub-optimal shapes in torus-2QoS

We are testing out a torus type topology in my environment and ran into an issue. We have 2 currently in testing, 1 that works which is a 4x2x1 and a second that doesn’t, configured as a 7x2x1. The 7x2x1 experiences credit loop problems when validating with ibdmchk

(ofed 1.5.3.4.0.12)

Using ibsim I replicated the credit loop problem with a virtual subnet configured with the same number of CA and switches. I reconfigured it as a 4x3x1 with the 7th cabinet disconnected and I think we can work with that. I thought the problem may have stemmed from having the Y axis only 2 switches tall, so the “ym_link” and “yp_link” had the same source and destination.

Out of curiosity I created a 36x3x1 fabric, 36 port switches, 16 CAs per switch. Here is the torus-2QoS.conf file. I used Switch73 as coordinates 0x0x0. This still causes credit loops and I have NO IDEA why. Any thoughts? I’m using ibsim 0.5 for the tests. The topology is created via a perl script I wrote.

Switch1 => Switch2 => Switch3 => …

Switch37 => Switch38 => Switch39 => …

Switch 73 => Switch74 => Switch75 => …

torus 36t 3t 1

xp_link 0x200048 0x200049

xm_link 0x200048 0x20006b

yp_link 0x200048 0x200000

ym_link 0x200048 0x200024

portgroup_max_ports 20

Here is the output from opensm.log upon starting it up:

[root@cinhpcdev4 torus-test2]# cat opensm.log

Nov 07 12:10:32 997544 [9E1C3780] 0x43 → OpenSM 3.3.13.MLNX_20130110_cd124d3

Nov 07 12:10:32 999452 [9E1C3780] 0x80 → OpenSM 3.3.13.MLNX_20130110_cd124d3

Nov 07 12:10:33 275154 [9E1C3780] 0x02 → osm_vendor_init: 100 pending umads specified

Nov 07 12:10:33 294725 [9E1C3780] 0x80 → Entering DISCOVERING state

Nov 07 12:10:33 331269 [9E1C3780] 0x02 → osm_vendor_bind: Binding to port 0x200000

Nov 07 12:10:33 432034 [9E1C3780] 0x02 → osm_vendor_bind: Binding to port 0x200000

Nov 07 12:10:33 449985 [9E1C3780] 0x02 → osm_vendor_bind: Binding to port 0x200000

Nov 07 12:10:33 468021 [9E1C3780] 0x02 → osm_opensm_bind: Setting IS_SM on port 0x0000000000200000

Nov 07 12:10:34 637631 [CF49E940] 0x80 → Entering MASTER state

Nov 07 12:10:34 637692 [CF49E940] 0x01 → osm_prtn_make_partitions: Partition configuration ./partitions.conf is not accessible (No such file or directory)

Nov 07 12:10:37 256120 [CF49E940] 0x02 → torus_build_lfts: Found fabric w/ 2592 links, 108 switches, 1728 CA ports, minimum 8 data VLs

Nov 07 12:10:37 256148 [CF49E940] 0x02 → torus_build_lfts: Looking for 36 x 3 x 1 torus

Nov 07 12:10:37 256165 [CF49E940] 0x02 → build_torus: Using torus seed configured as default (seed sw 0,0,0 GUID 0x200048).

Nov 07 12:10:37 257829 [CF49E940] 0x02 → torus_build_lfts: Built 36 x 3 x 1 torus w/ 2592 links, 108 switches, 1728 CA ports

Nov 07 12:10:37 309800 [CF49E940] 0x02 → osm_ucast_mgr_process: torus-2QoS tables configured on all switches

Nov 07 12:10:37 309874 [CF49E940] 0x01 → osm_qos_parse_policy_file: ERR AC01: Failed opening QoS policy file ./qos-policy.conf - No such file or directory

Nov 07 12:10:45 156308 [CF49E940] 0x02 → SUBNET UP

Nov 07 12:10:45 163642 [9E1C3780] 0x80 → Exiting SM

And here is the relevant output from ibdmchk -s ./opensm-subnet.lst -f ./opensm.fdbs -m ./opensm.mcfdbs -d ./opensm-sl2vl.dump

-I- Scanning all multicast groups for loops and connectivity…


-I- Using full credit loop check.

-I- Analyzing Fabric for Credit Loops 1 SLs, 8 VLs used.

-I- Traced 2984256 unicast paths

-E- Credit loop found on the following path:

S0000000000100000/N0000000000100000/P1 VL: 0 on path from lid: 0x0002 to lid: 0x03a8

S0000000000200000/N0000000000200000/P21 VL: 0 on path from lid: 0x0002 to lid: 0x03a8

S0000000000200023/N0000000000200023/P21 VL: 0 on path from lid: 0x0734 to lid: 0x0196

S0000000000200022/N0000000000200022/P21 VL: 0 on path from lid: 0x061a to lid: 0x0156

Don’t know if this is still issue or not but here are some comments:

I have not tested any of the topologies you mention:

4x2x1

7x2x1

36x3x1

The largest torus I’ve verified is 10x10x10.

These are all 2D rather than 3D tori. Note that a 2D torus must be configured with either the x or y radix

as 1 (i.e. configured as either a 1 x m x n or a m x 1 x n torus).

Also, the ones which are 2x1 are limited in fault (link or switch failure) in dimension with 2 switches but this has nothing to do with credit loops in non faulted case.

Looks like you are used MLNX OFED OpenSM. There have been a number of fixes/improvements to torus since the one you are using. If this is still of interest and still a problem, I would recommend updating to the most recent version (either MLNX OFED or upstream (latest 3.3.18 release) and retrying this. If it’s still a problem, would you post your ibnetdiscover output and the OpenSM configuration ?