How to use CPU isolation to reduce network latency.

We are using the Mellanox ConnectX-5 MT2780. We are trying to reduce network latency by using CPU isolation. We ran mlnx_tune, then passed “isolcpus=7-11” as kernel parameter. (our CPU has 12 cores) From the output of "taskset -cp 1"we knew the CPU isolation was successful. When we started our application we taskset it to run on CPU cores 7-11. But the network latency of our program end up to be much worse than without CPU isolation. Could you suggest what we are probably doing wrong? Thanks.

Did you try to check if the other application, like ‘sockperf’ for example show the same behaviour? What about RDMA applications, like ib_read_lat, ib_send_lat? Did you try to use ‘perf’ tool to compare behaviour with/without isolcpus and figure out where applications spends more time. Is the issue limited to Mellanox card only or can be reproduced with other vendor too? Is it TCP/IP or RDMA? Are you using Mellanox OFED ( ofed_info -s, should return valid output) or Inbox driver? Why do you suspect Mellanox card and not TCP/IP stack, OS, system or other settings?

Yes I tried using sockperf and got similar results.

It looks like if I enable both VMA and CPU isolation the latency was the worst 6.8 millisec,

while the best latency (5.28 microsec) occurs when I don’t have CPU isolation and disabled VMA.

This result is weird, exactly opposite to my expectation.

Could you tell me what I am doing wrong? Thanks.

##################################################

trial 1

with CPU isolation

with VMA enabled

##################################################

$ cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=UUID=8cd89e3c-79df-456b-b517-a628b744692d ro crashkernel=auto rhgb isolcpus=7-11 quiet skew_tick=1

$ taskset -cp 1

pid 1’s current affinity list: 0-6

$ VMA_SPEC=latency LD_PRELOAD=libvma.so taskset -c 7 sockperf sr

$ VMA_SPEC=latency LD_PRELOAD=libvma.so taskset -c 8 sockperf ul

[ 0] IP = 0.0.0.0 PORT = 11111 # UDP

sockperf: Warmup stage (sending a few dummy messages)…

sockperf: Starting test…

sockperf: Test end (interrupted by timer)

sockperf: Test ended

sockperf: [Total Run] RunTime=1.003 sec; Warm up time=400 msec; SentMessages=9898; ReceivedMessages=98

sockperf: ========= Printing statistics for Server No: 0

sockperf: [Valid Duration] RunTime=0.457 sec; SentMessages=4501; ReceivedMessages=46

sockperf: ====> avg-latency=6817.080 (std-dev=1815.039)

sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0

sockperf: Summary: Latency is 6817.080 usec

sockperf: Total 46 observations; each percentile contains 0.46 observations

sockperf: —> observation = 8501.150

sockperf: —> percentile 99.999 = 8501.150

sockperf: —> percentile 99.990 = 8501.150

sockperf: —> percentile 99.900 = 8501.150

sockperf: —> percentile 99.000 = 8501.150

sockperf: —> percentile 90.000 = 8462.867

sockperf: —> percentile 75.000 = 8337.827

sockperf: —> percentile 50.000 = 7489.708

sockperf: —> percentile 25.000 = 4864.749

sockperf: —> observation = 3363.551

##################################################

trial 2

without CPU isolation

with VMA enabled

##################################################

$ cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=UUID=8cd89e3c-79df-456b-b517-a628b744692d ro crashkernel=auto rhgb quiet skew_tick=1

$ taskset -cp 1

pid 1’s current affinity list: 0-11

$ VMA_SPEC=latency LD_PRELOAD=libvma.so sockperf sr

$ VMA_SPEC=latency LD_PRELOAD=libvma.so sockperf ul

[ 0] IP = 0.0.0.0 PORT = 11111 # UDP

sockperf: Warmup stage (sending a few dummy messages)…

sockperf: Starting test…

sockperf: Test end (interrupted by timer)

sockperf: Test ended

sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10006; ReceivedMessages=100

sockperf: ========= Printing statistics for Server No: 0

sockperf: [Valid Duration] RunTime=0.510 sec; SentMessages=5101; ReceivedMessages=52

sockperf: ====> avg-latency=8.525 (std-dev=1.558)

sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0

sockperf: Summary: Latency is 8.525 usec

sockperf: Total 52 observations; each percentile contains 0.52 observations

sockperf: —> observation = 11.204

sockperf: —> percentile 99.999 = 11.204

sockperf: —> percentile 99.990 = 11.204

sockperf: —> percentile 99.900 = 11.204

sockperf: —> percentile 99.000 = 11.175

sockperf: —> percentile 90.000 = 10.616

sockperf: —> percentile 75.000 = 9.575

sockperf: —> percentile 50.000 = 8.494

sockperf: —> percentile 25.000 = 7.460

sockperf: —> observation = 5.237

##################################################

trial 3

without CPU isolation

without VMA

##################################################

$ sockperf sr

$ sockperf ul

[ 0] IP = 0.0.0.0 PORT = 11111 # UDP

sockperf: Warmup stage (sending a few dummy messages)…

sockperf: Starting test…

sockperf: Test end (interrupted by timer)

sockperf: Test ended

sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10006; ReceivedMessages=100

sockperf: ========= Printing statistics for Server No: 0

sockperf: Test end (interrupted by signal 2)

sockperf: [Valid Duration] RunTime=0.520 sec; SentMessages=5201; ReceivedMessages=53

sockperf: ====> avg-latency=5.280 (std-dev=0.099)

sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0

sockperf: Summary: Latency is 5.280 usec

sockperf: Total 53 observations; each percentile contains 0.53 observations

sockperf: —> observation = 5.480

sockperf: —> percentile 99.999 = 5.480

sockperf: —> percentile 99.990 = 5.480

sockperf: —> percentile 99.900 = 5.480

sockperf: —> percentile 99.000 = 5.459

sockperf: —> percentile 90.000 = 5.420

sockperf: —> percentile 75.000 = 5.350

sockperf: —> percentile 50.000 = 5.275

sockperf: —> percentile 25.000 = 5.219

sockperf: —> observation = 5.079

what about to run sockperf without VMA with/without isolcpus?

Did you try “raw_ethernet_lat” application? It uses same RAW QP type as VMA and doesn’t require VMA in the middle.

Try to run perf record on specific CPU and see the difference.

Add “intel_idle.max_cstate=0 processor.max_cstate=1” to grub configuration following the tuning guide

https://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters_Archive.pdf

Try remove " skew_tick=1" and see if it can be related. Mellanox doesn’t use this parameter in the testing.

Here are the test results:

with isolcpus and with VMA

sockperf: Summary: Latency is 6817.080 usec

without isolcpus and with VMA

sockperf: Summary: Latency is 8.525 usec

without isolcpus and without VMA

sockperf: Summary: Latency is 5.280 usec

I don’t have raw_ethernet_lat and perf installed, how to install them?

I tried adding “intel_idle.max_cstate=0 processor.max_cstate=1” to grub configuration, it only improved latency a little bit, sockperf with VMA + isolcpus still needs 6xxx usec

Hi, are you using Mellanox OFED? Is it a standard Redhat or real-time?

We are not using Mellanox OFED.

It is a standard CentOS not using real time kernel.

We are using the Mellanox ConnectX-5 MT2780.

Could you check if it happens with latest Mellanox OFED v4.7 and latest VMA v8.9.4?

From the output of mlnx_tune, we are using mlnx-en-4.6-1.0.1.1 (OFED-4.6-1.0.1) ?

Mellanox Technologies - System Report

Operation System Status

UNKNOWN

3.10.0-957.el7.x86_64

CPU Status

GenuineIntel Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz Skylake

Warning: Frequency 3400.0MHz

Memory Status

Total: 30.92 GB

Free: 28.00 GB

Hugepages Status

On NUMA 1:

Transparent enabled: never

Transparent defrag: always

Hyper Threading Status

INACTIVE

IRQ Balancer Status

NOT PRESENT

Firewall Status

NOT PRESENT

IP table Status

NOT PRESENT

IPv6 table Status

NOT PRESENT

Driver Status

OK: mlnx-en-4.6-1.0.1.1 (OFED-4.6-1.0.1)

ConnectX-5 Device Status on PCI 12:00.0

FW version 16.25.1020

OK: PCI Width x16

OK: PCI Speed 8GT/s

PCI Max Payload Size 256

PCI Max Read Request 4096

Local CPUs list [0, 1, 2, 3, 4, 5]

ens1f0 (Port 1) Status

Link Type eth

OK: Link status Up

Speed 10GbE

MTU 1500

OK: TX nocache copy ‘off’

ConnectX-5 Device Status on PCI 12:00.1

FW version 16.25.1020

OK: PCI Width x16

OK: PCI Speed 8GT/s

PCI Max Payload Size 256

PCI Max Read Request 4096

Local CPUs list [0, 1, 2, 3, 4, 5]

ens1f1 (Port 1) Status

Link Type eth

OK: Link status Up

Speed 10GbE

MTU 1500

OK: TX nocache copy ‘off’

2019-10-29 10:13:46,493 INFO System info file: /tmp/mlnx_tune_191029_101343.log

The issue is not reproducible on my setup. What are the sockperf results with isolcpus and without VMA?

Try ‘perf’ tool to analyze the application and find a bottleneck.

Here is the result with isolcpu without VMA. And could you elaborate on “Try ‘perf’ tool to analyze the application and find a bottleneck.” What specifically should I do? Thanks.

##################################################

trial 4

with CPU isolation

without VMA

##################################################

$ cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=UUID=8cd89e3c-79df-456b-b517-a628b744692d ro crashkernel=auto rhgb quiet intel_idle.max_cstate=0 processor.max_cstate=1 isolcpus=7-11 skew_tick=1

$ taskset -cp 1

pid 1’s current affinity list: 0-6

$ taskset -c 7 sockperf sr


$ taskset -c 8 sockperf ul

sockperf: == version #3.6-no.git ==

sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 0.0.0.0 PORT = 11111 # UDP

sockperf: Warmup stage (sending a few dummy messages)…

sockperf: Starting test…

sockperf: Test end (interrupted by timer)

sockperf: Test end (interrupted by signal 2)

sockperf: Test ended

sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10006; ReceivedMessages=100

sockperf: ========= Printing statistics for Server No: 0

sockperf: [Valid Duration] RunTime=0.520 sec; SentMessages=5201; ReceivedMessages=53

sockperf: ====> avg-latency=5.171 (std-dev=0.058)

sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0

sockperf: Summary: Latency is 5.171 usec

sockperf: Total 53 observations; each percentile contains 0.53 observations

sockperf: —> observation = 5.302

sockperf: —> percentile 99.999 = 5.302

sockperf: —> percentile 99.990 = 5.302

sockperf: —> percentile 99.900 = 5.302

sockperf: —> percentile 99.000 = 5.287

sockperf: —> percentile 90.000 = 5.256

sockperf: —> percentile 75.000 = 5.204

sockperf: —> percentile 50.000 = 5.159

sockperf: —> percentile 25.000 = 5.136

sockperf: —> observation = 5.056