Azure + Mellanox (DPDK 19.11.11, 21.11.0, 22.03.0) - loosing packets after minutes to hours of fine operation

I have a VM in microsoft azure.There we run a DPDK application which reads traffic and duplicates it to a bunch of hosts using dpdk. If the server didn’t do anything for a longer time and we start the application, it runs fine for 1 to 5 hours, it successfully receives an average of about 25’000 packets per second. and sends them out ca. 42’000 packets per second.

After that we see a sudden increase in time used for the function call “rte_eth_rx_burst”, it increases from avg 300 ns to 150 us. At the same time we start loosing/not receiving also of packets while the interface shows no increase in imissed, ierrors or rx_nombuf.

 RX Port Information
     port           driver      packets       Mbytes       missed       errors   mbuf fails
        1     net_failsafe    398669949   47891.6982         7151            0            0
        0         mlx5_pci    398657879   47890.0013         7151            0            0
        2          net_tap        12209       1.7128            0            0            0

 TX Port Information
     port           driver      packets       Mbytes       errors
        1     net_failsafe    265721410   31610.0295            0
        0         mlx5_pci    265721410   31610.0295            0
        2          net_tap            0       0.0000            0

Note that after this degradation, the lost packet count should be far above 1 million. See timely graph below.
The different lines are just different streams in this product set. Occational paket drops are expected as it receives UDP from around the globe. (Y-Axis lost packets in log-scale)

The same application having the same input sources runs fine for weeks on Amazon AWS and Alibaba Cloud services.

The environment is a 16 core server with 2 accelerated network interfaces and Ubuntu 18.04 on it.

Network devices using kernel driver
51e0:00:02.0 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' if=enP20960s2 drv=mlx5_core unused=vfio-pci
e4da:00:02.0 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' if=enP58586s1 drv=mlx5_core unused=vfio-pci

Also activating all the logs --log-level=’.*,8’ does not give any insights about the time it degrades.
We tested dpdk 19.11.11, 21.11.0 and 22.03.0 on it using the following parameters.

<prog> -a 51e0:00:02.0 --vdev net_vdev_netvsc0,iface=eth1

On top, the test was run with 1 to 6 rx/tx queues, but it didn’t seem to make a hughe difference.

Good to know
If dpdk/mellanox,… is in this broken state, the application can be stopped and started again.
But it seems to be immediatelly in the broken state again. Waiting for some minutes lets the application run fine again for 5 to 15 minutes till it breaks again. If the machine is restarted, it usually works for 1 to 6 hours again.
As if the network driver/interface is in some weird broken state.

From the description, it’s hard to know what causes the issue. To isolate the problem cause, maybe could try to :

  1. check if the version of fw/driver is the same as in AWS/Alibaba, try to use the same version to test.
  2. check if the version of kernel/os is the same as in AWS/Alibaba, try to use the same version to test
  3. if possible, try to use the demo(ex: testpmd/l3fwd) to test and check if also have the same issue.
  4. check applications and systems for memory leaks
  5. use the newest version of OFED(including driver&firmware) to test.
    the newest version can be gotten from : Linux InfiniBand Drivers