SM LID is not configured warning

Hi All,

I have a pair of 4036 switches connected to a few server nodes. Each server node has a dual port Connectx2 card with one link to each switch. The switches then have two connections between each other. I am able to send data across the network and all seems ok, but running ibchecknet on either switch throws some warnings:

#warn: Lid is not configured lid 6 port 2

#warn: SM Lid is not configured

Port check lid 6 port 2: FAILED

#warn: Lid is not configured lid 6 port 34

#warn: SM Lid is not configured

Port check lid 6 port 34: FAILED

#warn: Lid is not configured lid 6 port 35

#warn: SM Lid is not configured

Port check lid 6 port 35: FAILED

#warn: Lid is not configured lid 1 port 1

#warn: SM Lid is not configured

Port check lid 1 port 1: FAILED

ibnetdiscover does show all devices as expected so I am unsure what this warning is all about. Any advice?

112MB/s * 8 bits/byte = 892 Mb/s which is close to 1 Gbps so that makes sense to me.

1500-2000 MB/s = 12-16 Gb/s. 40 Gbps IB is 32 Gbps data rate max so that’s .375-.5 of the 4x QDR IB “line” rate.

Those scripts were deprecated back in April 2011.

Try ibqueryerrors

I don’t know if this exists in Windows environment though.

Switches have two links between each other (ports 1-1 and 2-2). Each server has one link to each switch (Ports 34-36).

Here is the output of net discover:

4036-SW1(utilities)# ibnetdiscover

Topology file: generated on Thu Jun 30 10:32:28 2016

Initiated from node 0008f10500203b28 port 0008f10500203b28

vendid=0x8f1

devid=0x5a5a

sysimgguid=0x8f10500109553

switchguid=0x8f10500109552(8f10500109552)

Switch 36 “S-0008f10500109552” # “Mellanox 4036 # 4036-SW2” enhanced port 0 lid 6 lmc 0

[1] “S-0008f10500203b28”[1] # “Mellanox 4036 # 4036-SW1” lid 1 4xQDR

[2] “S-0008f10500203b28”[2] # “Mellanox 4036 # 4036-SW1” lid 1 4xQDR

[34] "H-0002c903004e445a"1 # “IGA-S2D1” lid 2 4xQDR

[35] "H-0008f104039a3c1c"2 # “IGA-S2D2” lid 5 4xQDR

[36] "H-0008f104039a4e3c"2 # “IGA-S2D3” lid 8 4xQDR

vendid=0x8f1

devid=0x5a5a

sysimgguid=0x8f10500203b29

switchguid=0x8f10500203b28(8f10500203b28)

Switch 36 “S-0008f10500203b28” # “Mellanox 4036 # 4036-SW1” enhanced port 0 lid 1 lmc 0

[1] “S-0008f10500109552”[1] # “Mellanox 4036 # 4036-SW2” lid 6 4xQDR

[2] “S-0008f10500109552”[2] # “Mellanox 4036 # 4036-SW2” lid 6 4xQDR

[34] "H-0002c903004e445a"2 # “IGA-S2D1” lid 3 4xQDR

[35] "H-0008f104039a3c1c"1 # “IGA-S2D2” lid 4 4xQDR

[36] "H-0008f104039a4e3c"1 # “IGA-S2D3” lid 7 4xQDR

vendid=0x2c9

devid=0x673c

sysimgguid=0x8f104039a4e3f

caguid=0x8f104039a4e3c

Ca 2 “H-0008f104039a4e3c” # “IGA-S2D3”

1 “S-0008f10500203b28”[36] # lid 7 lmc 0 “Mellanox 4036 # 4036-SW1” lid 1 4xQDR

2 “S-0008f10500109552”[36] # lid 8 lmc 0 “Mellanox 4036 # 4036-SW2” lid 6 4xQDR

vendid=0x2c9

devid=0x673c

sysimgguid=0x8f104039a3c1f

caguid=0x8f104039a3c1c

Ca 2 “H-0008f104039a3c1c” # “IGA-S2D2”

1 “S-0008f10500203b28”[35] # lid 4 lmc 0 “Mellanox 4036 # 4036-SW1” lid 1 4xQDR

2 “S-0008f10500109552”[35] # lid 5 lmc 0 “Mellanox 4036 # 4036-SW2” lid 6 4xQDR

vendid=0x2c9

devid=0x673c

sysimgguid=0x2c903004e445d

caguid=0x2c903004e445a

Ca 2 “H-0002c903004e445a” # “IGA-S2D1”

1 “S-0008f10500109552”[34] # lid 2 lmc 0 “Mellanox 4036 # 4036-SW2” lid 6 4xQDR

2 “S-0008f10500203b28”[34] # lid 3 lmc 0 “Mellanox 4036 # 4036-SW1” lid 1 4xQDR

4036-SW1(utilities)#

Ping and data transfer works, but transfer speed is only about 2gb/sec which seems very low. Still trying to figure out if I have a switch/fabric issue or if the slow speed is just configuration. Server are Windows using the latest supported Mellanox driver for the Connectx-2 cards (4.80) and firmware (2.10.720). RDMA appears to be detecting fine on the hosts, so at least some of the config seems to be correct =/

Hi Hal,

Thanks for the help on this.

I am running between two servers with the following:

ntttcp.exe -r -a 16 -t 15 -m 16,*,10.1.4.11

ntttcp.exe -s -a 16 -t 15 -m 16,*,10.1.4.11

Which two servers does not seem to make a difference (all three are identical anyway). PortXmitWait increases quite a bit on the send side of things (both node and switch ports) when running these tests.

PortXmitWait on sending side means some link is slow (I suspect sending side is faster than receiving side).

I’m not familiar with Windows performance tuning.

Is ntttcp throughput in bytes/sec or bits/sec ? I was assuming bits/sec but it looks like it might be bytes/sec from the ntttcp posts I just looked at.

Sorry, you mention those scripts as depreciated. Is there a newer set i should be using to test this?

Running ntttcp.exe per http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

Throughput(MB/s) = 1531.362

Results are similar with file copy or other network test tools.

ibqueryerrors does exist, but no -f switch on windows it seems. This is the base output:

PS C:> ibqueryerrors

Errors for “IGA-S2D3”

GUID 0x8f104039a4e3d port 1: [PortXmitWait == 1]

GUID 0x8f104039a4e3e port 2: [PortXmitWait == 541647202]

Errors for 0x8f10500203b28 “Mellanox 4036 # 4036-SW1”

GUID 0x8f10500203b28 port ALL: [PortXmitWait == 86266712]

GUID 0x8f10500203b28 port 1: [PortXmitWait == 33430055]

GUID 0x8f10500203b28 port 34: [PortXmitWait == 52836657]

Errors for 0x8f10500109552 “Mellanox 4036 # 4036-SW2”

GUID 0x8f10500109552 port ALL: [PortXmitWait == 2344169726]

GUID 0x8f10500109552 port 0: [PortXmitWait == 261]

GUID 0x8f10500109552 port 34: [PortXmitWait == 2344169465]

Errors for “IGA-S2D1”

GUID 0x2c903004e445b port 1: [PortXmitWait == 59]

Summary: 5 nodes checked, 4 bad nodes found

80 ports checked, 9 ports have errors beyond threshold

Hi Adam,

The scripts are old and deprecated.

There are no LIDs or SM LIDs for the SM to configure on switch external ports (lid 6 ports 2, 34, and 35) so those warnings are meaningless.

Is LID 1 port 1 an HCA in some node ? That might be of concern. What is it connected to ? Is there an SM running on that subnet to which it is attached ?

– Hal

So LID 1 is other switch. So this also a is false warning from that script.

ibnetdiscover says 4xQDR for all your links so this looks right. That would be 10 Gbps (signaling rate) derated to 8 Gbps (max data rate). What app are you running to determine 2Gbps thruput ? Switch/fabric looks OK to me unless there are errors being encountered. Try ibqueryerrors and see what it says.

Are you running single NTttcp receiver/sender between servers or are there multiple of these going on concurrently ?

Is PortXmitWait increasing ? It is indicative of congestion. Perhaps some machine is slow (receiving). Are results same independent of which server ?

Here is the full output, throughput is MegaBytes/Sec. To compare a normal 1GBE connection usually tests round 112 MB/S. These have been in the 1500-2000 range.

Copyright Version 5.31

Network activity progressing…

Thread Time(s) Throughput(KB/s) Avg B / Compl

====== ======= ================ =============

0 15.005 61078.307 65536.000

1 15.005 147897.368 65536.000

2 15.005 31699.300 65536.000

3 15.005 60741.353 65536.000

4 15.005 117652.516 65536.000

5 15.005 24687.238 65536.000

6 15.005 95541.486 65536.000

7 15.005 73238.520 65536.000

8 15.005 142587.138 65536.000

9 15.130 87679.577 65536.000

10 15.005 153727.957 65536.000

11 15.005 147705.432 65536.000

12 15.005 67552.949 65536.000

13 15.005 73229.990 65536.000

14 15.005 142220.327 65536.000

15 15.005 133327.291 65536.000

Totals:

Bytes(MEG) realtime(s) Avg Frame Size Throughput(MB/s)

================ =========== ============== ================

22878.187500 15.010 3913.049 1524.196

Throughput(Buffers/s) Cycles/Byte Buffers

===================== =========== =============

24387.142 0.821 366051.000

DPCs(count/s) Pkts(num/DPC) Intr(count/s) Pkts(num/intr)

============= ============= =============== ==============

43232.911 1.078 87842.039 0.530

Packets Sent Packets Received Retransmits Errors Avg. CPU %

============ ================ =========== ====== ==========

6130646 699305 4427 0 1.955

Indeed, I feel like 75-90% of line rate is what I should expect. My experience with infiniband is pretty much none though, so maybe this is the expected throughput for IPoIB? I kinda doubt it. It there a better way to test this or any other ideas?

In general, IPoIB does not show off IB performance but this largely depends on packet sizes used.

The fact that the throughput is independent of which machines used indicates that it’s either host tuning issue or this is expectation but I have no knowledge of what Windows expectations are or what the tuning knobs are so hopefully someone else can help you here.

Thanks Hal, I appreciate the help. I will probably create a fresh question about speed benchmarks/tuning as its a bit off topic from what this started as.