`rte_flow_async_create` takes a huge 1 million cycles on ConnectX-6 Dx DX

Hi everyone,

I’ve been using the DPDK rte_flow asynchronous API to create a huge amount of rules at run-time, and I’ve been stunned at how slow the time it takes to rte_flow_async_create. For context, I did this:

uint64_t start_cycles = rte_get_tsc_cycles();

entry->src_rule_handle = rte_flow_async_create(dpdk_port_id, ctx->flow_queue_id, &async_op_params, Ptrs->offload.src_template_tables[dpdk_port_id], src_pattern, pattern_index, src_actions, 0, entry, &error);uint64_t 

end_cycles = rte_get_tsc_cycles();

The difference averages around 1 million cycles.

My specs:

  Device Type:      ConnectX6DX
  Part Number:      MCX623106AN-CDA_Ax
  Description:      ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0/3.0 x16;
  PSID:             MT_0000000359
  Versions:         Current        Available     
     FW             22.46.1006     N/A           
     PXE            3.8.0100       N/A           
     UEFI           14.39.0013     N/A           

And I’m using DPDK version 24.11.2.

I have tried the following mitigation strategies:

  • I tried toggling the postpone flag with little effect.
  • I made sure to avoid contention by running the test with one thread only.
  • I used four different NICs to verify my results.
  • Always made sure to run dv_flow_en=2.
  • Tried running on different groups, actions, and patterns. All made no effect on performance, except if I set the src_pattern to NULL, albeit this matches everything

Am I doing something wrong? I’m not seeing the huge throughput advertised.

Hey mrasheed,

How many rules are we working with here, and what is the rate you are seeing?

Does using multi cores make any difference in your test?

What are the matches and actions for flow rules?

Are we calling rte_flow_configure() prior to creating the flow rules?
For more information on configuring HW Steering:

https://doc.dpdk.org/guides-24.11/nics/mlx5.html

https://github.com/DPDK/dpdk/blob/main/doc/guides/nics/mlx5.rst#hardware-steering

Thanks.
Eric

Hi Eric,

I’ve debugged this issue with Dariusz Sosnowski from DPDK’s Slack channel.

Essentially, the issue is writing rules to group 0. Apparently the performance overhead of creating a group 0 rule is extremely large; adding a jump to group 1, and then creating a rule eliminated this problem.

I struggled to find this documented in the docs, so I’ll try adding it myself via patches.

Thanks.
Mohand