Issue:
I have a VM in microsoft azure.There we run a DPDK application which reads traffic and duplicates it to a bunch of hosts using dpdk. If the server didn’t do anything for a longer time and we start the application, it runs fine for 1 to 5 hours, it successfully receives an average of about 25’000 packets per second. and sends them out ca. 42’000 packets per second.
After that we see a sudden increase in time used for the function call “rte_eth_rx_burst”, it increases from avg 300 ns to 150 us. At the same time we start loosing/not receiving also of packets while the interface shows no increase in imissed, ierrors or rx_nombuf.
RX Port Information
port driver packets Mbytes missed errors mbuf fails
1 net_failsafe 398669949 47891.6982 7151 0 0
0 mlx5_pci 398657879 47890.0013 7151 0 0
2 net_tap 12209 1.7128 0 0 0
TX Port Information
port driver packets Mbytes errors
1 net_failsafe 265721410 31610.0295 0
0 mlx5_pci 265721410 31610.0295 0
2 net_tap 0 0.0000 0
Note that after this degradation, the lost packet count should be far above 1 million. See timely graph below.
The different lines are just different streams in this product set. Occational paket drops are expected as it receives UDP from around the globe. (Y-Axis lost packets in log-scale)
The same application having the same input sources runs fine for weeks on Amazon AWS and Alibaba Cloud services.
Environment:
The environment is a 16 core server with 2 accelerated network interfaces and Ubuntu 18.04 on it.
Network devices using kernel driver
===================================
51e0:00:02.0 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' if=enP20960s2 drv=mlx5_core unused=vfio-pci
e4da:00:02.0 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' if=enP58586s1 drv=mlx5_core unused=vfio-pci
Also activating all the logs --log-level=’.*,8’ does not give any insights about the time it degrades.
We tested dpdk 19.11.11, 21.11.0 and 22.03.0 on it using the following parameters.
<prog> -a 51e0:00:02.0 --vdev net_vdev_netvsc0,iface=eth1
On top, the test was run with 1 to 6 rx/tx queues, but it didn’t seem to make a hughe difference.
Good to know
If dpdk/mellanox,… is in this broken state, the application can be stopped and started again.
But it seems to be immediatelly in the broken state again. Waiting for some minutes lets the application run fine again for 5 to 15 minutes till it breaks again. If the machine is restarted, it usually works for 1 to 6 hours again.
As if the network driver/interface is in some weird broken state.