DPDK rte_flow is degrading performance when testing on Connect X5 100G EN @ 100G

Hi,

I am using DPDK 18.11 on Ubuntu 18.04, with Mellanox Connect X-5 100G EN (MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu18.04-x86_64).

Packet generator: t-rex 2.49 running on another machine.

I am able to achieve 100G line rate with l3fwd application (frame size 64B) using the parameters suggested in their performance report.

(https://fast.dpdk.org/doc/perf/DPDK_18_11_Mellanox_NIC_performance_report.pdf)

However, as soon as I install rte_flow rules to steer packets to different queues and/or use rte_flow’s mark action, the throughput reduces to ~41G. I also modified DPDK’s flow_filtering example application, and am getting the same reduced throughput of around 41G out of 100G. But without rte_flow, it goes to 100G.

I didn’t change any OS/Kernel parameters to test l3fwd or the application that uses rte_flow. I also ensure the application is numa-aware and use 20 cores to handle 100G traffic.

Upon further investigation (using Mellanox NIC counters), the drop in throughput is due to mbuf allocation errors.

Is such performance degradation normal when performing hw-acceleration

using rte_flow on this NIC?

Has anyone tested throughput performance using rte_flow @ 100G?

Its surprising to see hardware offloading is degrading the performance, unless I am doing something wrong.

Thanks,

Arvind

P.S. I have posted the same question on DPDK users community. If any Mellanox devs are on there, request them to respond to the same.

Hi,

Can you share the different tuning you applied ?

Did you enable the CQE_COMPRESSION ?Flow control disable ?

Did you try to increase the number of hugespages used ?How many buffer descriptors, rx/tx queues ?

Can you provide the rte flow rules you added ?

Thanks

Marc

Sure. I mostly tried to use the same settings as your DPDK 18.11 Performance Test configurations.

Yes, CQE_COMPRESSION is enabled, Flow Control disabled.

Lots of hugepages available, almost 56x1G free.

Kernel Parameters

aum:~$ sudo cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-4.15.0-43-generic ro quiet splash isolcpus=24-47 intel_idle.max_cstate=0 processor.max_cstate=0 intel_pstate=disable nohz_full=24-47 rcu_nocbs=24-47 rcu_nocb_poll default_hugepagesz=1G hugepagesz=1G hugepages=64 audit=0 nosoftlockup vt.handoff=1

Hugepages

aum:~$ grep -i huge /proc/meminfo

AnonHugePages: 0 kB

ShmemHugePages: 0 kB

HugePages_Total: 64

HugePages_Free: 56

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 1048576 kB

MLNX TUNE OUTPUT

aum:~$ sudo mlnx_tune

Mellanox Technologies - System Report

Operation System Status

UBUNTU18.04

4.15.0-43-generic

CPU Status

GenuineIntel Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz Skylake

Warning: Frequency 3408.62MHz

Memory Status

Total: 376.62 GB

Free: 296.17 GB

Hugepages Status

On NUMA 1:

Transparent enabled: madvise

Transparent defrag: madvise

Hyper Threading Status

INACTIVE

IRQ Balancer Status

INACTIVE

Firewall Status

NOT PRESENT

IP table Status

NOT PRESENT

IPv6 table Status

NOT PRESENT

Driver Status

OK: MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)

ConnectX-5EX Device Status on PCI af:00.0

FW version 16.24.1000

OK: PCI Width x16

Warning: PCI Speed 8GT/s >>> PCI width status is below PCI capabilities. Check PCI configuration in BIOS.

PCI Max Payload Size 256

PCI Max Read Request 1024

Local CPUs list [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47]

enp175s0f0 (Port 1) Status

Link Type eth

OK: Link status Up

Speed 100GbE

MTU 1500

OK: TX nocache copy ‘off’

ConnectX-5EX Device Status on PCI af:00.1

FW version 16.24.1000

OK: PCI Width x16

Warning: PCI Speed 8GT/s >>> PCI width status is below PCI capabilities. Check PCI configuration in BIOS.

PCI Max Payload Size 256

PCI Max Read Request 1024

Local CPUs list [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47]

enp175s0f1 (Port 1) Status

Link Type eth

OK: Link status Up

Speed 100GbE

MTU 1500

OK: TX nocache copy ‘off’

RING PARAMETERS

aum:~$ sudo ethtool -g enp175s0f0

Ring parameters for enp175s0f0:

Pre-set maximums:

RX: 8192

RX Mini: 0

RX Jumbo: 0

TX: 8192

Current hardware settings:

RX: 8192

RX Mini: 0

RX Jumbo: 0

TX: 8192

aum:~$ sudo ethtool -g enp175s0f1

Ring parameters for enp175s0f1:

Pre-set maximums:

RX: 8192

RX Mini: 0

RX Jumbo: 0

TX: 8192

Current hardware settings:

RX: 8192

RX Mini: 0

RX Jumbo: 0

TX: 8192

RX/TX Queues

20 RX and 20 TX queues

Buffer Descriptions on Socket 1

50k+

rte_flow

rte_flow rules were similar to DPDK’s flow_filtering example.

I generate packets with dst_ip = 192.168.1.x - where x ranges from 1 to 80

I call from generate_ipv4_flow() function in this file to install the rules.

I steer every dst_ip (out of 80) to one of the 20 RX queue.

Snippet below.

/* create flow for send packet with */

uint32_t tempdst, tempsrc;

struct rte_flow_error error;

tempsrc = IPv4(192, 168, 0, 1);

for (int i=1; i<81; i++) {

tempdst = IPv4(192, 168, 1, i);

flow = generate_ipv4_flow(0, i%20,

tempsrc, 0,

tempdst, 32, &error);

if (!flow) {

RTE_LOG(ERR, CALF, “Flow can’t be created %d message: %s\n”,

error.type,

error.message ? error.message : “(no stated reason)”);

rte_exit(EXIT_FAILURE, “Error in creating flow”);

}

RTE_LOG(INFO, CALF, “Flow %d created.\n”, i);

}

RTE_LOG(INFO, CALF, “All flows created.\n”);

If I do not install any flow rules, I am able to achieve line rate. :\

Troubleshooting this was rather very painful, and not obvious.

DEV_TX_OFFLOAD_VLAN_INSERT

DEV_TX_OFFLOAD_TCP_TSO

Configuring ports with the above tx offload settings were causing this unusual drop.

Both these settings are part of DPDK’s flow filtering example source code, making it less obvious.

I am not sure if this drop in throughput was expected, if not there could be some bug in its implementation.

Arvind or Marc, was this degradation due to any type of rte_flow action or those specific ones? Is there a solution if you’re using those?​

What are the specs of machine?

And have you also tried dpdk-pktgen to see what type of result you’ll get using that?

cx5/cx6 with dpdk

set dv_flow_en=1(default) will effect rx performance

set dv_flow_en=0 (verb flow) no this problem!