Odd, unsymmetric ib_send_lat results?

I have two small IB clusters set up for testing:

  • Both have SB7700 IB switch
  • Two servers, each with a MCX455A-ECAT ConnectX-4 VPI adapter, are connected to each switch.

Essential system and software info:

[root@fs00 ~]# uname -a

Linux fs00 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@fs00 ~]# rpm -qa |grep ofed

ofed-scripts-3.3-OFED.3.3.1.0.0.x86_64

mlnx-ofed-all-3.3-1.0.0.0.noarch

I have been testing the two clusters using ib_send_lat and observed the following that I don’t understand:

Cluster A:

Server:

I got what the latency numbers that I anticipated. Reverse the role of client and server, results more or less the same. Again, that’s what I anticipated.

[root@fs01 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x0028 PSN 0xd0b6fe

remote address: LID 0x03 QPN 0x0028 PSN 0x91642e


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.81 4.47 0.83

4 1000 0.82 3.88 0.83

8 1000 0.81 2.95 0.83

16 1000 0.82 3.31 0.84

32 1000 0.88 3.40 0.90

64 1000 0.88 3.27 0.90

128 1000 0.91 3.54 0.93

256 1000 1.23 3.55 1.25

512 1000 1.29 4.17 1.32

1024 1000 1.49 3.15 1.51

2048 1000 1.72 4.32 1.74

4096 1000 2.15 4.32 2.20


continued…

Client:

[root@fs00 ~]# ib_send_lat -a -c UD 192.168.11.151

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x03 QPN 0x0028 PSN 0x91642e

remote address: LID 0x02 QPN 0x0028 PSN 0xd0b6fe


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.81 8.37 0.83

4 1000 0.82 3.87 0.83

8 1000 0.81 2.97 0.83

16 1000 0.82 3.31 0.84

32 1000 0.88 3.41 0.89

64 1000 0.88 3.27 0.90

128 1000 0.91 3.55 0.93

256 1000 1.23 3.56 1.25

512 1000 1.30 4.15 1.32

1024 1000 1.48 3.17 1.51

2048 1000 1.72 4.32 1.74

4096 1000 2.16 4.32 2.20


continued…

local address: LID 0x03 QPN 0x002c PSN 0x8f46d

remote address: LID 0x02 QPN 0x002c PSN 0x9c2fe5


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.76 5.30 0.78

4 1000 0.78 4.56 0.79

8 1000 0.76 3.80 0.78

16 1000 0.77 3.39 0.79

32 1000 0.83 3.07 0.84

64 1000 0.84 5.82 0.86

128 1000 0.86 3.95 0.88

256 1000 1.17 4.01 1.19

512 1000 1.25 4.64 1.27

1024 1000 1.45 3.70 1.46

2048 1000 1.67 5.21 1.70

4096 1000 2.13 4.72 2.16


Client:

[root@fs11 ~]# ib_send_lat -a -c UD 192.168.12.150

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x002c PSN 0x9c2fe5

remote address: LID 0x03 QPN 0x002c PSN 0x8f46d


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.76 5.29 0.78

4 1000 0.77 4.57 0.79

8 1000 0.77 3.80 0.78

16 1000 0.77 3.38 0.79

32 1000 0.83 3.06 0.84

64 1000 0.84 5.77 0.86

128 1000 0.86 3.95 0.88

256 1000 1.17 3.97 1.19

512 1000 1.25 4.65 1.27

1024 1000 1.44 3.69 1.47

2048 1000 1.67 5.18 1.70

4096 1000 2.13 4.68 2.16


I am very puzzled by the above outcome. Would appreciate any hints as to what I can do to figure out what’s causing the large latency.

continued…

Cluster B:

As shown below, in Direction I, the client side max latency is about 10X larger. What’s odd is that once I reversed the role of client and server, both showed the latency numbers that I anticipated.

Server:

Direct I

[root@fs11 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x002b PSN 0x79fb69

remote address: LID 0x03 QPN 0x002b PSN 0xfbae7e


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.76 4.93 0.78

4 1000 0.77 3.60 0.79

8 1000 0.76 4.16 0.78

16 1000 0.77 3.54 0.79

32 1000 0.83 3.60 0.85

64 1000 0.83 3.74 0.85

128 1000 0.86 3.52 0.88

256 1000 1.18 4.68 1.20

512 1000 1.25 3.88 1.27

1024 1000 1.44 4.71 1.46

2048 1000 1.68 4.20 1.70

4096 1000 2.13 3.91 2.16


Client:

[root@fs10 ~]# ib_send_lat -a -c UD 192.168.12.151

Max msg size in UD is MTU 4096

Changing to this MTU


reply…

I ran the ib_send_lat in each host, so client and server were in the same host. The numbers all look reasonable. Now both cables and the switch for the subnet are not involved. Any suggestion to what to test to narrow down the causes of the spikes?

When you suggested a loopback test, did you mean to test the two cables? I.e. take each cable, loop-back between two ports, see if the link comes up? I do have SM running on the switch.

BTW, I have updated both MLNX_OFED and firmware for HCAs to the latest. An example from running hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-3.3-1.0.4.0 (OFED-3.3-1.0.4): modules

Host Driver RPM Check … PASS

Firmware on CA #0 HCA … v12.16.1006

Firmware Check on CA #0 (HCA) … NA

REASON: NO required fw version

Host Driver Initialization … PASS

Number of CA Ports Active … 1

Port State of Port #1 on CA #0 (HCA)… UP 4X EDR (InfiniBand)

Error Counter Check on CA #0 (HCA)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (HCA) … 7c:fe:90:03:00:29:26:b6

------------------ DONE ---------------------

I repeated the tests that I did before, I still observed spikes.

Host fs10

Server:

[root@fs10 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x03 QPN 0x0031 PSN 0xc0fb9e

remote address: LID 0x03 QPN 0x0030 PSN 0x3bd5ea


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.67 4.41 0.69

4 1000 0.67 4.77 0.69

8 1000 0.67 4.77 0.69

16 1000 0.67 4.28 0.69

32 1000 0.71 4.93 0.72

64 1000 0.71 5.22 0.72

128 1000 0.75 4.80 0.76

256 1000 1.06 4.20 1.08

512 1000 1.14 4.79 1.16

1024 1000 1.27 5.08 1.29

2048 1000 1.54 5.62 1.55

4096 1000 2.04 5.71 2.06


continued…

Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x03 QPN 0x002a PSN 0x544e64

remote address: LID 0x02 QPN 0x002a PSN 0x7babed


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.76 45.78 0.78

4 1000 0.77 30.98 0.79

8 1000 0.76 37.99 0.78

16 1000 0.77 43.70 0.79

32 1000 0.83 47.34 0.85

64 1000 0.84 39.94 0.86

128 1000 0.86 41.16 0.88

256 1000 1.18 37.54 1.20

512 1000 1.24 42.94 1.26

1024 1000 1.43 39.50 1.45

2048 1000 1.66 42.06 1.69

4096 1000 2.11 40.37 2.15


Direct II

Server:

[root@fs10 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


continued…

Client:

[root@fs10 ~]# ib_send_lat -a -c UD 192.168.12.150

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x03 QPN 0x0030 PSN 0x3bd5ea

remote address: LID 0x03 QPN 0x0031 PSN 0xc0fb9e


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.67 5.92 0.68

4 1000 0.67 4.93 0.69

8 1000 0.67 4.80 0.68

16 1000 0.67 4.29 0.69

32 1000 0.70 4.95 0.72

64 1000 0.70 5.21 0.72

128 1000 0.75 4.81 0.76

256 1000 1.07 4.20 1.08

512 1000 1.14 4.80 1.16

1024 1000 1.27 5.08 1.29

2048 1000 1.53 5.63 1.55

4096 1000 2.04 5.70 2.06


Host fs11

Server:

[root@fs11 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x002d PSN 0xf82dfe

remote address: LID 0x02 QPN 0x002c PSN 0x49619e


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.68 2.60 0.69

4 1000 0.67 2.34 0.69

8 1000 0.67 2.06 0.69

16 1000 0.68 1.95 0.69

32 1000 0.71 1.83 0.72

64 1000 0.71 1.82 0.72

128 1000 0.75 1.91 0.76

256 1000 1.07 3.26 1.09

512 1000 1.14 2.50 1.15

1024 1000 1.28 2.71 1.30

2048 1000 1.54 2.83 1.56

4096 1000 2.05 2.76 2.07


continued…

Client:

[root@fs11 ~]# ib_send_lat -a -c UD 192.168.12.151

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x002c PSN 0x49619e

remote address: LID 0x02 QPN 0x002d PSN 0xf82dfe


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.67 5.75 0.69

4 1000 0.67 2.37 0.69

8 1000 0.67 5.52 0.69

16 1000 0.67 1.86 0.69

32 1000 0.70 2.01 0.72

64 1000 0.71 1.85 0.72

128 1000 0.75 1.90 0.76

256 1000 1.06 5.13 1.08

512 1000 1.13 2.27 1.15

1024 1000 1.28 2.74 1.30

2048 1000 1.53 2.86 1.56

4096 1000 2.04 6.10 2.07


continued…

I also ran ibdiagnet -pc -lw 4x -ls 25 -P all=1 --pm_pause_time 600 --get_cable_info on all the four servers. All resulted


Summary

-I- Stage Warnings Errors Comment

-I- Discovery 0 0

-I- Lids Check 0 0

-I- Links Check 0 0

-I- Subnet Manager 0 0

-I- Port Counters 0 0

-I- Nodes Information 0 0

-I- Speed / Width checks 0 0

-I- Partition Keys 0 0

-I- Alias GUIDs 0 0

-I- Temperature Sensing 0 0

-I- Cable Diagnostic (Plugin) 0 0

reply…

6 days ago, I reported that I fixed my IPoIB setup. I just found sometime a while ago to revisit this issue. Indeed, as I suspected, the original, incorrect IPoIB setup was the cause of jitters observed, most likely due to the fact that for 10G Ethernet ports on servers, I have been using some inexpensive third party SFP+ DACs So, problem solved for now.

The last column is the most important as it shows the typical value, and these values are close. One of the hosts, probably, has something different in the settings -BIOS, OS tuning, some daemon running, management interrupts and it cause to spikes in some iterations.

Try to run loopback tests.

I ran the ib_send_lat in each host, so client and server were in the same host. The numbers all look reasonable. Now both cables and the switch for the subnet are not involved. Any suggestion to what to test to narrow down the causes of the spikes?

When you suggested a loopback test, did you mean to test the two cables? I.e. take each cable, loop-back between two ports, see if the link comes up? I do have SM running on the switch.

BTW, I have updated both MLNX_OFED and firmware for HCAs to the latest. An example from running hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-3.3-1.0.4.0 (OFED-3.3-1.0.4): modules

Host Driver RPM Check … PASS

Firmware on CA #0 HCA … v12.16.1006

Firmware Check on CA #0 (HCA) … NA

REASON: NO required fw version

Host Driver Initialization … PASS

Number of CA Ports Active … 1

Port State of Port #1 on CA #0 (HCA)… UP 4X EDR (InfiniBand)

Error Counter Check on CA #0 (HCA)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (HCA) … 7c:fe:90:03:00:29:26:b6

------------------ DONE ---------------------

I repeated the tests that I did before, I still observed spikes.

Host fs10

Server:

[root@fs10 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x03 QPN 0x0031 PSN 0xc0fb9e

remote address: LID 0x03 QPN 0x0030 PSN 0x3bd5ea


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.67 4.41 0.69

4 1000 0.67 4.77 0.69

8 1000 0.67 4.77 0.69

16 1000 0.67 4.28 0.69

32 1000 0.71 4.93 0.72

64 1000 0.71 5.22 0.72

128 1000 0.75 4.80 0.76

256 1000 1.06 4.20 1.08

512 1000 1.14 4.79 1.16

1024 1000 1.27 5.08 1.29

2048 1000 1.54 5.62 1.55

4096 1000 2.04 5.71 2.06


Client:

[root@fs10 ~]# ib_send_lat -a -c UD 192.168.12.150

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x03 QPN 0x0030 PSN 0x3bd5ea

remote address: LID 0x03 QPN 0x0031 PSN 0xc0fb9e


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.67 5.92 0.68

4 1000 0.67 4.93 0.69

8 1000 0.67 4.80 0.68

16 1000 0.67 4.29 0.69

32 1000 0.70 4.95 0.72

64 1000 0.70 5.21 0.72

128 1000 0.75 4.81 0.76

256 1000 1.07 4.20 1.08

512 1000 1.14 4.80 1.16

1024 1000 1.27 5.08 1.29

2048 1000 1.53 5.63 1.55

4096 1000 2.04 5.70 2.06


Host fs11

Server:

[root@fs11 ~]# ib_send_lat -a -c UD


  • Waiting for client to connect… *

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

RX depth : 1000

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x002d PSN 0xf82dfe

remote address: LID 0x02 QPN 0x002c PSN 0x49619e


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.68 2.60 0.69

4 1000 0.67 2.34 0.69

8 1000 0.67 2.06 0.69

16 1000 0.68 1.95 0.69

32 1000 0.71 1.83 0.72

64 1000 0.71 1.82 0.72

128 1000 0.75 1.91 0.76

256 1000 1.07 3.26 1.09

512 1000 1.14 2.50 1.15

1024 1000 1.28 2.71 1.30

2048 1000 1.54 2.83 1.56

4096 1000 2.05 2.76 2.07


Client:

[root@fs11 ~]# ib_send_lat -a -c UD 192.168.12.151

Max msg size in UD is MTU 4096

Changing to this MTU


Send Latency Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : UD Using SRQ : OFF

TX depth : 1

Mtu : 4096[B]

Link type : IB

Max inline data : 188[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet


local address: LID 0x02 QPN 0x002c PSN 0x49619e

remote address: LID 0x02 QPN 0x002d PSN 0xf82dfe


#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]

2 1000 0.67 5.75 0.69

4 1000 0.67 2.37 0.69

8 1000 0.67 5.52 0.69

16 1000 0.67 1.86 0.69

32 1000 0.70 2.01 0.72

64 1000 0.71 1.85 0.72

128 1000 0.75 1.90 0.76

256 1000 1.06 5.13 1.08

512 1000 1.13 2.27 1.15

1024 1000 1.28 2.74 1.30

2048 1000 1.53 2.86 1.56

4096 1000 2.04 6.10 2.07


I also ran ibdiagnet -pc -lw 4x -ls 25 -P all=1 --pm_pause_time 600 --get_cable_info on all the four servers. All resulted


Summary

-I- Stage Warnings Errors Comment

-I- Discovery 0 0

-I- Lids Check 0 0

-I- Links Check 0 0

-I- Subnet Manager 0 0

-I- Port Counters 0 0

-I- Nodes Information 0 0

-I- Speed / Width checks 0 0

-I- Partition Keys 0 0

-I- Alias GUIDs 0 0

-I- Temperature Sensing 0 0

-I- Cable Diagnostic (Plugin) 0 0

You should take a look on the typical value (the right column) , which is good. There is always a chance that one of the iterations can have higher value then typical.

6 days ago, I reported that I fixed my IPoIB setup. I just found sometime a while ago to revisit this issue. Indeed, as I suspected, the original, incorrect IPoIB setup was the cause of jitters observed, most likely due to the fact that for 10G Ethernet ports on servers, I have been using some inexpensive third party SFP+ DACs So, problem solved for now.