random delays on two back-to-back connected systems with ConnectX-5 2x100G cards

Dear Mellanox community

we observe random delays on two back-to-back connected systems with ConnectX-5 2x100G cards

example

(long stretch of normal delay; excessive delay 10-100 ms, sometimes lasting very short, sometimes for minutes; again normal delay,no losses)

[1630922510.721460] 64 bytes from 192.168.1.2: icmp_seq=36406 ttl=64 time=0.025 ms

[1630922510.737460] 64 bytes from 192.168.1.2: icmp_seq=36407 ttl=64 time=0.024 ms

[1630922510.783855] 64 bytes from 192.168.1.2: icmp_seq=36408 ttl=64 time=30.4 ms

[1630922510.783861] 64 bytes from 192.168.1.2: icmp_seq=36409 ttl=64 time=14.4 ms

[1630922510.783887] 64 bytes from 192.168.1.2: icmp_seq=36410 ttl=64 time=0.018 ms

[1630922510.797453] 64 bytes from 192.168.1.2: icmp_seq=36411 ttl=64 time=0.021 ms

system: Supermicro AS-1113S-WN10RT, 256GB RAM

CPU: AMD EPYC 7702P 64-Core

NICs MT27800 Family [ConnectX-5 2x100G], 2 cards

cable: 100G-CR4 MLX (DAC)

driver: mlx5_core

version: 5.4-1.0.3 (newest as of writing this)

firmware-version: 16.31.1014 (MT_0000000012) (newest as of writing this)

kernel/OS: 5.4.0-81-generic #91~18.04.1-Ubuntu SMP (newest as of writing this, for ubuntu 18.04 LTS HWE)

CPU no lower than C1 power state

storage 2x 8TB samsung:nvme:PM1733:2.5 + 2x 256G Micron_2200 (RAID1, boot)

The systems host OSDs for a very lightly loaded test ceph cluster, (storage access + storage replication is of order of 50-100Mb/s currently), CPU 0-1%, 10VMs connecting to it from external hypervisior hosts, mostly idle)

One NIC is connected back-to-back to the other host via plain L2/L3 (no vlan, no bonding, default parameters) for the purpose of this test and we see delays (also) there.

The other ConnectX-5 card (actually used for ceph) is connected to a pair of dell S4048 in VLT/LACP mode (which then connects via eVLT to another rack with the same config). NICs neogtiate 40G which bacause of dellS4048-ON offers 40G only.

We observe the excessive delay to happen on both cards (ceph and back to back), more or less at the same time, so we believe we can take the switches and the bonding out of the equation for now.

kern.log, syslog, journalctl do not show anything around the time of the event (and no other errors during boot etc.)

topology (fixed font)

host nvme01 host nvme02

enp129s0f0----enp129s0f0 (back to back connection, we also see excessive delay here)

enp129s0f1----enp129s0f1

host nvme01 host nvme02

enp129s0f0 (bond0) — (vlt) dell4048-01 --evlt-- dell 4048-03 (vlt) – (bond0) enp129s0f0

enp129s0f1 (bond0) — (vlt) dell4048-02 --evlt-- dell 4048-04 (vlt) – (bond0)enp129s0f0

Thank you and regards

Piotrek Z

Hi Piotrek Z,

There are lots of issue can cause this problem ,.

You’d better collect a sysinfo-snapshot and send mail to

Nvidia Support Admin networking-support@nvidia.com

Thanks,

Suo​

Hi Suo,

we spent quite some time on researching it, trying system power options, card power options, buffers, offload etc. Ultimately, one of the colleagues after several sessions with perf and stap narrowed it down to large time in IO virtual address alocation (alloc_iova, side effect was also slow write on NVMe). The suggestion was to disable iommu.

GRUB_CMDLINE_LINUX_DEFAULT=“amd_iommu=off”

After this change, the systems behave much better, after several days of pinging every 0.01s we found literally 1 with delay >1ms and rest of the statistics looking OK

31502327 packets transmitted, 31502326 received, 0% packet loss, time 505214691ms

rtt min/avg/max/mdev = 0.021/0.054/3.989/0.018 ms

We have also tested almost identical system but with Ubuntu 20.04.2 LTS and saw no problems.

Best

Piotrek Z

Hi Suo,

For reference find the perf flamegraph below:

as is evident from the famegraph, alloc_iova is eating a lot of CPU time spinlocking.

A systemtap histogram reveals that some IO virtual address allocations would take up to 17ms(!).

We have observed round trip delays of over 200ms.

/usr/local/bin/stap -DMAXSKIPPED=300 -v histogram-kernel.stp ‘kernel.function(“alloc_iova”)’ 2

Pass 1: parsed user script and 480 library scripts using 128112virt/98420res/5960shr/92812data kb, in 240usr/20sys/250real ms.

Pass 2: analyzed script: 4 probes, 3 functions, 1 embed, 3 globals using 198212virt/169484res/6848shr/162912data kb, in 810usr/100sys/910real ms.

Pass 3: translated to C into “/tmp/stapU5aTxv/stap_8402ca207afb505112b5a6b5969166c8_2884_src.c” using 198212virt/169812res/7168shr/162912data kb, in 10usr/80sys/87real ms.

Pass 4: compiled C into “stap_8402ca207afb505112b5a6b5969166c8_2884.ko” in 2030usr/280sys/2190real ms.

Pass 5: starting run.

Cpu 110 was already in that function?

Cpu 112 was already in that function?

Duration min:0us avg:57us max:17490us count:3920

Duration (us):

value |-------------------------------------------------- count

0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2342

1 |@@@@@@@@@@ 477

2 |@@@@@@@ 359

4 |@@@@@@ 286

8 |@@@@ 195

16 |@@ 119

32 |@ 80

64 | 33

128 | 10

256 | 2

512 | 0

1024 | 1

2048 | 1

4096 | 3

8192 | 8

16384 | 4

32768 | 0

65536 | 0

WARNING: Number of errors: 0, skipped probes: 123

Pass 5: run completed in 20usr/110sys/2646real ms.

As mentioned by Piotr, disabling amd_iommu resolved this behaviour.

In our other AMD Epyc setup, running kernel 5.11.0-25-generic #27~20.04.1-Ubuntu SMP we have not observed this issue.

Regards,

Borgert van der Kluit