We are using the Mellanox ConnectX-5 MT2780. We are trying to reduce network latency by using CPU isolation. We ran mlnx_tune, then passed “isolcpus=7-11” as kernel parameter. (our CPU has 12 cores) From the output of "taskset -cp 1"we knew the CPU isolation was successful. When we started our application we taskset it to run on CPU cores 7-11. But the network latency of our program end up to be much worse than without CPU isolation. Could you suggest what we are probably doing wrong? Thanks.
Did you try to check if the other application, like ‘sockperf’ for example show the same behaviour? What about RDMA applications, like ib_read_lat, ib_send_lat? Did you try to use ‘perf’ tool to compare behaviour with/without isolcpus and figure out where applications spends more time. Is the issue limited to Mellanox card only or can be reproduced with other vendor too? Is it TCP/IP or RDMA? Are you using Mellanox OFED ( ofed_info -s, should return valid output) or Inbox driver? Why do you suspect Mellanox card and not TCP/IP stack, OS, system or other settings?
Yes I tried using sockperf and got similar results.
It looks like if I enable both VMA and CPU isolation the latency was the worst 6.8 millisec,
while the best latency (5.28 microsec) occurs when I don’t have CPU isolation and disabled VMA.
This result is weird, exactly opposite to my expectation.
Could you tell me what I am doing wrong? Thanks.
##################################################
trial 1
with CPU isolation
with VMA enabled
##################################################
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=UUID=8cd89e3c-79df-456b-b517-a628b744692d ro crashkernel=auto rhgb isolcpus=7-11 quiet skew_tick=1
$ taskset -cp 1
pid 1’s current affinity list: 0-6
$ VMA_SPEC=latency LD_PRELOAD=libvma.so taskset -c 7 sockperf sr
…
$ VMA_SPEC=latency LD_PRELOAD=libvma.so taskset -c 8 sockperf ul
[ 0] IP = 0.0.0.0 PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)…
sockperf: Starting test…
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.003 sec; Warm up time=400 msec; SentMessages=9898; ReceivedMessages=98
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.457 sec; SentMessages=4501; ReceivedMessages=46
sockperf: ====> avg-latency=6817.080 (std-dev=1815.039)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 6817.080 usec
sockperf: Total 46 observations; each percentile contains 0.46 observations
sockperf: —> observation = 8501.150
sockperf: —> percentile 99.999 = 8501.150
sockperf: —> percentile 99.990 = 8501.150
sockperf: —> percentile 99.900 = 8501.150
sockperf: —> percentile 99.000 = 8501.150
sockperf: —> percentile 90.000 = 8462.867
sockperf: —> percentile 75.000 = 8337.827
sockperf: —> percentile 50.000 = 7489.708
sockperf: —> percentile 25.000 = 4864.749
sockperf: —> observation = 3363.551
##################################################
trial 2
without CPU isolation
with VMA enabled
##################################################
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=UUID=8cd89e3c-79df-456b-b517-a628b744692d ro crashkernel=auto rhgb quiet skew_tick=1
$ taskset -cp 1
pid 1’s current affinity list: 0-11
$ VMA_SPEC=latency LD_PRELOAD=libvma.so sockperf sr
…
$ VMA_SPEC=latency LD_PRELOAD=libvma.so sockperf ul
[ 0] IP = 0.0.0.0 PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)…
sockperf: Starting test…
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10006; ReceivedMessages=100
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.510 sec; SentMessages=5101; ReceivedMessages=52
sockperf: ====> avg-latency=8.525 (std-dev=1.558)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 8.525 usec
sockperf: Total 52 observations; each percentile contains 0.52 observations
sockperf: —> observation = 11.204
sockperf: —> percentile 99.999 = 11.204
sockperf: —> percentile 99.990 = 11.204
sockperf: —> percentile 99.900 = 11.204
sockperf: —> percentile 99.000 = 11.175
sockperf: —> percentile 90.000 = 10.616
sockperf: —> percentile 75.000 = 9.575
sockperf: —> percentile 50.000 = 8.494
sockperf: —> percentile 25.000 = 7.460
sockperf: —> observation = 5.237
##################################################
trial 3
without CPU isolation
without VMA
##################################################
$ sockperf sr
…
$ sockperf ul
[ 0] IP = 0.0.0.0 PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)…
sockperf: Starting test…
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10006; ReceivedMessages=100
sockperf: ========= Printing statistics for Server No: 0
sockperf: Test end (interrupted by signal 2)
sockperf: [Valid Duration] RunTime=0.520 sec; SentMessages=5201; ReceivedMessages=53
sockperf: ====> avg-latency=5.280 (std-dev=0.099)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 5.280 usec
sockperf: Total 53 observations; each percentile contains 0.53 observations
sockperf: —> observation = 5.480
sockperf: —> percentile 99.999 = 5.480
sockperf: —> percentile 99.990 = 5.480
sockperf: —> percentile 99.900 = 5.480
sockperf: —> percentile 99.000 = 5.459
sockperf: —> percentile 90.000 = 5.420
sockperf: —> percentile 75.000 = 5.350
sockperf: —> percentile 50.000 = 5.275
sockperf: —> percentile 25.000 = 5.219
sockperf: —> observation = 5.079
what about to run sockperf without VMA with/without isolcpus?
Did you try “raw_ethernet_lat” application? It uses same RAW QP type as VMA and doesn’t require VMA in the middle.
Try to run perf record on specific CPU and see the difference.
Add “intel_idle.max_cstate=0 processor.max_cstate=1” to grub configuration following the tuning guide
Try remove " skew_tick=1" and see if it can be related. Mellanox doesn’t use this parameter in the testing.
Here are the test results:
with isolcpus and with VMA
sockperf: Summary: Latency is 6817.080 usec
without isolcpus and with VMA
sockperf: Summary: Latency is 8.525 usec
without isolcpus and without VMA
sockperf: Summary: Latency is 5.280 usec
I don’t have raw_ethernet_lat and perf installed, how to install them?
I tried adding “intel_idle.max_cstate=0 processor.max_cstate=1” to grub configuration, it only improved latency a little bit, sockperf with VMA + isolcpus still needs 6xxx usec
Hi, are you using Mellanox OFED? Is it a standard Redhat or real-time?
We are not using Mellanox OFED.
It is a standard CentOS not using real time kernel.
We are using the Mellanox ConnectX-5 MT2780.
Could you check if it happens with latest Mellanox OFED v4.7 and latest VMA v8.9.4?
From the output of mlnx_tune, we are using mlnx-en-4.6-1.0.1.1 (OFED-4.6-1.0.1) ?
Mellanox Technologies - System Report
Operation System Status
UNKNOWN
3.10.0-957.el7.x86_64
CPU Status
GenuineIntel Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz Skylake
Warning: Frequency 3400.0MHz
Memory Status
Total: 30.92 GB
Free: 28.00 GB
Hugepages Status
On NUMA 1:
Transparent enabled: never
Transparent defrag: always
Hyper Threading Status
INACTIVE
IRQ Balancer Status
NOT PRESENT
Firewall Status
NOT PRESENT
IP table Status
NOT PRESENT
IPv6 table Status
NOT PRESENT
Driver Status
OK: mlnx-en-4.6-1.0.1.1 (OFED-4.6-1.0.1)
ConnectX-5 Device Status on PCI 12:00.0
FW version 16.25.1020
OK: PCI Width x16
OK: PCI Speed 8GT/s
PCI Max Payload Size 256
PCI Max Read Request 4096
Local CPUs list [0, 1, 2, 3, 4, 5]
ens1f0 (Port 1) Status
Link Type eth
OK: Link status Up
Speed 10GbE
MTU 1500
OK: TX nocache copy ‘off’
ConnectX-5 Device Status on PCI 12:00.1
FW version 16.25.1020
OK: PCI Width x16
OK: PCI Speed 8GT/s
PCI Max Payload Size 256
PCI Max Read Request 4096
Local CPUs list [0, 1, 2, 3, 4, 5]
ens1f1 (Port 1) Status
Link Type eth
OK: Link status Up
Speed 10GbE
MTU 1500
OK: TX nocache copy ‘off’
2019-10-29 10:13:46,493 INFO System info file: /tmp/mlnx_tune_191029_101343.log
The issue is not reproducible on my setup. What are the sockperf results with isolcpus and without VMA?
Try ‘perf’ tool to analyze the application and find a bottleneck.
Here is the result with isolcpu without VMA. And could you elaborate on “Try ‘perf’ tool to analyze the application and find a bottleneck.” What specifically should I do? Thanks.
##################################################
trial 4
with CPU isolation
without VMA
##################################################
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.el7.x86_64 root=UUID=8cd89e3c-79df-456b-b517-a628b744692d ro crashkernel=auto rhgb quiet intel_idle.max_cstate=0 processor.max_cstate=1 isolcpus=7-11 skew_tick=1
$ taskset -cp 1
pid 1’s current affinity list: 0-6
$ taskset -c 7 sockperf sr
$ taskset -c 8 sockperf ul
sockperf: == version #3.6-no.git ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)
[ 0] IP = 0.0.0.0 PORT = 11111 # UDP
sockperf: Warmup stage (sending a few dummy messages)…
sockperf: Starting test…
sockperf: Test end (interrupted by timer)
sockperf: Test end (interrupted by signal 2)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10006; ReceivedMessages=100
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.520 sec; SentMessages=5201; ReceivedMessages=53
sockperf: ====> avg-latency=5.171 (std-dev=0.058)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 5.171 usec
sockperf: Total 53 observations; each percentile contains 0.53 observations
sockperf: —> observation = 5.302
sockperf: —> percentile 99.999 = 5.302
sockperf: —> percentile 99.990 = 5.302
sockperf: —> percentile 99.900 = 5.302
sockperf: —> percentile 99.000 = 5.287
sockperf: —> percentile 90.000 = 5.256
sockperf: —> percentile 75.000 = 5.204
sockperf: —> percentile 50.000 = 5.159
sockperf: —> percentile 25.000 = 5.136
sockperf: —> observation = 5.056