Azure + Mellanox (DPDK 19.11.11, 21.11.0, 22.03.0) - loosing packets after minutes to hours of fine operation

deforation · March 27, 2022, 9:06am

Issue:
I have a VM in microsoft azure.There we run a DPDK application which reads traffic and duplicates it to a bunch of hosts using dpdk. If the server didn’t do anything for a longer time and we start the application, it runs fine for 1 to 5 hours, it successfully receives an average of about 25’000 packets per second. and sends them out ca. 42’000 packets per second.

After that we see a sudden increase in time used for the function call “rte_eth_rx_burst”, it increases from avg 300 ns to 150 us. At the same time we start loosing/not receiving also of packets while the interface shows no increase in imissed, ierrors or rx_nombuf.

 RX Port Information
     port           driver      packets       Mbytes       missed       errors   mbuf fails
        1     net_failsafe    398669949   47891.6982         7151            0            0
        0         mlx5_pci    398657879   47890.0013         7151            0            0
        2          net_tap        12209       1.7128            0            0            0

 TX Port Information
     port           driver      packets       Mbytes       errors
        1     net_failsafe    265721410   31610.0295            0
        0         mlx5_pci    265721410   31610.0295            0
        2          net_tap            0       0.0000            0

Note that after this degradation, the lost packet count should be far above 1 million. See timely graph below.
The different lines are just different streams in this product set. Occational paket drops are expected as it receives UDP from around the globe. (Y-Axis lost packets in log-scale)

The same application having the same input sources runs fine for weeks on Amazon AWS and Alibaba Cloud services.

Environment:
The environment is a 16 core server with 2 accelerated network interfaces and Ubuntu 18.04 on it.

Network devices using kernel driver
===================================
51e0:00:02.0 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' if=enP20960s2 drv=mlx5_core unused=vfio-pci
e4da:00:02.0 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' if=enP58586s1 drv=mlx5_core unused=vfio-pci

Also activating all the logs --log-level=‘.*,8’ does not give any insights about the time it degrades.
We tested dpdk 19.11.11, 21.11.0 and 22.03.0 on it using the following parameters.

<prog> -a 51e0:00:02.0 --vdev net_vdev_netvsc0,iface=eth1

On top, the test was run with 1 to 6 rx/tx queues, but it didn’t seem to make a hughe difference.

Good to know
If dpdk/mellanox,… is in this broken state, the application can be stopped and started again.
But it seems to be immediatelly in the broken state again. Waiting for some minutes lets the application run fine again for 5 to 15 minutes till it breaks again. If the machine is restarted, it usually works for 1 to 6 hours again.
As if the network driver/interface is in some weird broken state.

Levei_Luo · May 17, 2022, 3:26am

From the description, it’s hard to know what causes the issue. To isolate the problem cause, maybe could try to :

check if the version of fw/driver is the same as in AWS/Alibaba, try to use the same version to test.
check if the version of kernel/os is the same as in AWS/Alibaba, try to use the same version to test
if possible, try to use the demo(ex: testpmd/l3fwd) to test and check if also have the same issue.
check applications and systems for memory leaks
use the newest version of OFED(including driver&firmware) to test.
the newest version can be gotten from : Linux InfiniBand Drivers

Regards,
Levei

Topic		Replies	Views
Performance Test finding bottleneck and optimization Network Management Products dpdk , mellanox-ofed	2	1953	March 17, 2022
RX Out-of-Buffer Issues on Mellanox NIC with High-Rate Packet Generator Traffic Ethernet Adapter Cards	3	563	February 20, 2026
Unable to bind Mellanox interfaces on to DPDK on Ubuntu/Azure Software And Drivers	3	1162	May 1, 2019
Hi I am using Mellanox connectx-5 100GbE Nic card. When i tried to run ASAP2 OVS offload, I have observed followings Ethernet Adapter Cards	0	360	February 10, 2020
DPDK: 'mlx5_net: failed to set defaults flows' on Ubuntu 22.04 Mellanox OFED dpdk , mellanox-ofed	3	288	July 16, 2025
Mlx5_net: Failed to allocate Tx DevX UAR (BF/NC) Virtualization For Infiniband And Ethernet dpdk	2	1847	March 18, 2024
OVS-DPDK Offloading VXLAN，run testpmd on VM，Tx-pps can reach 8.9Mpps, but Rx-pps only 0.7Mpps Mellanox OFED tx	5	1342	November 9, 2020
Packet loss with multi-frame payloads Application Accelerator Software	1	765	August 10, 2017
What is the root cause of packet loss in the processing flow of the function mlx5_rx_err_handle(DPDK mlx5 pmd)? Application Accelerator Software dpdk	2	1161	March 2, 2023
DPDK APIs rte_eth_dev_count_avail() returns 0 with the MLX5 NIC card Software And Drivers mst	8	2224	November 29, 2019

Azure + Mellanox (DPDK 19.11.11, 21.11.0, 22.03.0) - loosing packets after minutes to hours of fine operation

Related topics