The raw throughput of BlueField-1 cannot reach the line rate

Hi, I am testing the raw throughput of BlueField DPU, the version is MBF1L516A-CSNAT. I find that the throughput of 64B-packet flow cannot reach the line rate 100 Gbps (or 148 Mpps). I don’t know whether my measurement strategy is correct.

Target - [SmartNIC/EMBEDDED_CPU mode]
  • The throughput when send packets from local BlueField OS to local host
  • The throughput when send packets from local host to local BlueField OS
Environment and Tools
  • The host is Ubuntu 20.04 with Linux 5.4.0 kernel, the BlueField OS is Ubuntu 20.04 with Linux 5.4.0-1008-bluefield kernel
  • DPDK-21.11, pktgen-21.11.0
  • ‘Ethtool $NIC rx off tx off’ to disable pause frame
Results
  • We use 14 cores for pktgen TX on local host, and 14 cores for pktgen RX on BlueField. When sending 64B packets from local host to local BlueField, the maximum send throughput is ~44 Mpps, and the corresponding receive throughput is ~31 Mpps. When the packet size is changed to 1500B, the throughput can reach ~110 Gbps, which is about the line rate of PCIe bandwidth
  • We use 14 cores for pktgen RX on local host, and 14 cores for pktgen TX on BlueField. When sending 64B packets from local BlueField to local host, the maximum send throughput is ~32 Mpps, while the recieve throughput is ~19 Mpps. When the packet size is changed to 1500B, the throughput can also reach the line rate (~110 Gbps)
Questions
  • Does this card reach line rate using DPDK? Or does it mean that BlueField’s ARM cores cannot support generating packets or receiving packets at line rate?
  • For sending from local host to local BlueField, why the host cannot generate packets at line rate? Are there some back pressures from BlueField?
  • Are they better ways to measure the throughput of BlueField card?

Thanks. :)

Hi user52115,

Thank you for posting your inquiry to the NVIDIA developer forums.

In your results, you state that you can achieve line rate using larger payload size (1500b), as opposed to smaller packet sizes (64b). This is expected - using smaller packets will achieve shorter latency, whereas using larger packets will enable higher throughput (at the cost of latency).

You may be able to realize better results via system tuning:
https://support.mellanox.com/s/article/performance-tuning-for-mellanox-adapters

If you require further assistance with system tuning or benchmarking, please open a support case at: https://support.mellanox.com/s/

Best,
NVIDIA Networking Support

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.