ConnectX-4 RX performance issues on DPDK

,

Are there any RX-side pps performance tips for ConnectX-4/PMD mlx5 family?

Our usecase requires optimising RX pps, I don’t care about TX. Adding more receiving lcores actually decreases RX performance.

After applying performance tips I am able to achieve 107M pps on TX side (no RX) using one 5-tuple or around 92M pps using 16 5-tuples for better RSS hashing.

However, I am not able to exceed 60Mpps on RX side in very specific case and around 18-37Mpps in more typical cases. (Performance is heavily affected by increasing number of queues above 4).

Running our DPDK application on 2x10G and 4x10G cards on different PMDs we have much more predictable performance scaling. I would rather expect that with 8 RX lcores I would be close to 100M RX pps.

Test setup details:

  • testpmd + dpdk-pktgen or dpdk-pktgen alone
  • DPDK 17.11
  • one 2x100G OEM card Mellanox Technologies MT27700 Family [ConnectX-4], mt4115, FW upgraded to 12.21
  • two ports connected to itself via copper MCP1600 1m
  • PCIe 3 16x slot, DevCtl MaxPayload 256 bytes, MaxReadReq 1024 bytes
  • E5-2650 v4 @ 2.20GHz CPU (12 cores), turbo disabled

I’m not expecting 148Mpps here, but according to performance results from http://fast.dpdk.org/doc/perf/DPDK_17_11_Mellanox_NIC_performance_report.pdf http://fast.dpdk.org/doc/perf/DPDK_17_11_Mellanox_NIC_performance_report.pdf , card should be able to do >90Mpps full duplex using single port.

I use two ports, one port for RX, one for TX, though.

Example commands:

./testpmd --file-prefix=820 --socket-mem=8192,8192 -l 12-23 -n 2 -w 0000:82:00.0,txq_inline=256 – --port-topology=chained --forward-mode=rxonly --rss-udp --rxq=2 --txq=2 --nb-cores=8 --socket-num=1 --stats-period=1 --burst=128 --rxd=2048 --txd=512

./testpmd --file-prefix=820 --socket-mem=8192,8192 -l 12-23 -n 2 -w 0000:82:00.0,txq_inline=256 – --port-topology=chained --forward-mode=rxonly --rss-udp --rxq=8 --txq=8 --nb-cores=8 --socket-num=1 --stats-period=1 --burst=128 --rxd=2048 --txd=512

./pktgen --file-prefix=both --socket-mem=28672,28672 -w 0000:82:00.0,txq_inline=256,txqs_min_inline=4 -w 0000:82:00.1,txq_inline=256,txqs_min_inline=4 -l 0-11,12-23 -n 4 – -P -N -T -m “[1:12-15].0, [16-23:1].1”

I will respond to myself, hope somebody will find this useful.

  • moving traffic from card0:port0-card0/port1 to card0:port0-card1/port0 helped a lot
  • dpdk-pktgen requires some code tuning, more mbufs, larger burst etc.
  • dpdk-pktgen range traffic seems to be skewed sometimes/is not equally distributed by RX’s side RSS
  • I have better experience with testpmd in txmode; rxonly mode does not randomise IP addresses and flowgen mode is very slow
  • be careful as testpmd requires #rx cores = #tx cores (it silently uses MIN of these two numbers), in pktgen one can assing only 1 core for RX-doing-nothing which was better for txonly performance; ./pktgen --file-prefix=second --socket-mem=128,16384 -w 0000:82:00.0,txq_inline=128 -l 0,12-23 -n 2 – -N -T -m “[12:13-23].0”
  • all in all, I was able to reach
    • around 85Mpps rxonly traffic using 8 cores (2.1GHz, turbo off) and probable can do a little more (spare cpu cycles) as it achieved 100% generator performance;
    • up to 107Mpps txonly using 11 local NUMA cores + some borrowed remote NUMA cores

I am facing similar issue with ConnectX-5 (dual port 100G, pcie 4)

I am running pktgen in another server, connected to RX server via 100G DAC (only one 100G port is used for testing). pktgen is generating 25 Mpps.

But at RX server, it is receiving at a rate of 12-14 Mpps. I tried RSS and spread it to 4 RX Queues and dedicated lcore for reading from each RX Queue. But the collective RX capability still remains 12-14 Mpps. No matter how many more Queues i inrease, total RX rate remains the same as 12-14 Mpps.

Any help would be highly appreciated. RSS conf I used at receiver side is given below

.rx_adv_conf = {

.rss_conf = {

.rss_hf = ETH_RSS_IP | ETH_RSS_UDP |

ETH_RSS_TCP | ETH_RSS_SCTP,

}

},