Hello guys,
i am playing around with a Bluefield 2 DPU and i am trying to write a simple load generator using DPDK. My Bluefield 2 is on embbeded mode. My main function is just a loop sending out packet from p0 of DPU to another port of server where DPU is installed. I use rte_eth_tx_burst api to generate packet. I try to ran multiple threads on different cores use rte_mp_remote_launch. But my peak pps in packet payload size 64 is about 2Million no matter how many cores i used. Can you guys give me some insight how can i achieve performance improvement using multiple cores?
maybe, you could use dpdk-testpmd tool to send packets in mode ‘tx only’.
Thanks for your response. I am now using DPDK 24.11.0-rc0 and i ran testpmd with -l 0-4 -n 4 -m 2048 -a 03:00.0 – --portmask=0x1 --txd=8192 --rxd=8192 --mbcache=512 --rxq=1 --txq=2 --burst=32 --nb-cores=2 --forward-mode=txonly -i
Through top it only utilizes one core despite i specify 2 txqs and 2 cores to it.
show config fwd
txonly packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) → TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
testpmd> start
txonly packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) → TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
txonly packet forwarding packets/burst=32
packet len=64 - nb packet segments=1
nb forwarding cores=2 - nb forwarding ports=1
port 0: RX queue number: 1 Tx queue number: 2
Rx offloads=0x0 Tx offloads=0x10000
RX queue: 0
RX desc=8192 - RX free threshold=64
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=8192 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x10000 - TX RS bit threshold=0
It seems that my configuration does not really affect its behavior. Can you tell me how to correctly enable multicore pkt generation with testpmd.
There is another problem that I measure the pkt send on the receiver side use pfcount but the pkt received is mush less than the amount showed on test-pmd. Is it normal?
1.run dpdk-testpmd in interactive mode , both uplink port could be used!, cmd like this:
/opt/mellanox/dpdk/bin/dpdk-testpmd -a 03:00.0 -a 03:00.1 --socket-mem=1024 – --total-num-mbufs=13100 -i -a --nb-cores=2 --forward-mode=txonly
if everythins is ok , the output will be like this
then you can type ‘show port stats all’ to show the traffice statistics for 2 ports in real-time
if you have too much thing to do in pkt generation, to much calcauting or main memory access(cache miss), the tx speed will not reach the line speed !
obviously, the single arm cpu core could not send 64 byte small pkt in line rate!
the line rate is 37.2m pps if uplink port‘s physical bandwidth is 25g
I understand i can use two ports. But i want to realize that that there are multiple tx queues mapped to one port and for each tx queue there is a core utilized for pkt generation. For example, i want to realize for port 0 there are two tx_queus and for each tx_queue there is a core mapped to it to generate packets. I only want to see if there is a increase of performace using multiple core compared to single core. Can i achieve this using test-pmd?
For the implementation of my own simple application. I tried hugepagesize of 1G but there is no performance increase than hugepagesize of 2M. And using top i can see there are multiple cores utilized but in the receiver side there is no performance increase.
I just can not get the same total tx-packets number showed in port statistics of testpmd on the receiver side. Is there a better tool to use for testeing real performance than pfconut? I just get the feeling either not all packets are really sent or not all packets sent are counted.
i got what you want!
the main point is multiple core use multiple queue to improve performance of sending !
use testpmd like this:
/opt/mellanox/dpdk/bin/dpdk-testpmd -a 00:03:00.1 --socket-mem=1024 – --total-num-mbufs=13100 -i --rxq=2 --txq=2 --nb-cores=2 --forward-mode=txonly
2 lcore and 2 rtx queue! the tx speed is twice as much as 1 lcore’s speed!
tx side by BF2 DPU
rx side by ConnectX-5