I am looking for help on an issue integrating DPDK with a Mellanox ConnectX-6 NIC in our AMD EPYC server.
I will start by saying that things were previously working fine, but seemingly out of nowhere, the mlx5_core driver started logging this error during DPDK’s initialization in my application:
mlx5_net: port 0 failed to set defaults flows
This occurs when my application calls rte_eth_dev_start(), and then exits with error code 22.
Furthermore, when I run DPDK’s testpmd application, I again see a log related to flow; this doesn’t appear to be an error, but since testpmd is not code that I wrote and the log mentions flow configuration, I thought it could be pertinent:
testpmd: Flow tunnel offload support might be limited or unavailable on port 0.
I have actually installed DPDK on 2 servers. There are minor configuration differences on Server 2, but regardless, both machines exhibit the exact same error behaviour.
Server 1:
CPU: AMD EPYC 9354
OS: Ubuntu 22.04.4 LTS
Kernel: 5.15.0-116-generic
NIC: MCX653105A-HDA_Ax
NIC FW: 20.43.2566
Lspci: 01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
DPDK: Arkville 24.11.1 (leverages several kmods to integrate the Arkville IP on FPGA cards with DPDK)
Dpdk-devbind: 0000:01:00.0 ‘MT28908 Family [ConnectX-6] 101b’ numa_node=0 if=enp1s0np0 drv=mlx5_core unused=vfio-pci
Server 2 differences:
OS: Ubuntu 22.04.5 LTS
Kernel: 5.15.0-142-generic
DPDK: Arkville 25.03
Things I have tried:
- Cold booting the system
- Unbinding/rebinding the mlx5_core driver
- Updating/reverting NIC FW:
○ 20.43.2566 is the newest LTS version. I also tried reverting to 20.35.2000, which is the oldest LTS version
○ Ran verification with flint -d … verify; success.
○ Updated nVidia’s MFT tools from v4.30.1-113 to v4.32.0-120
○ I realized today that I should have installed the DOCA-Networking package rather than just MFT, so I uninstalled MFT and installed the latest DOCA-Networking (v3.0.0-058000-25) - Uninstalling/reinstalling DPDK
- Removing all parameters from the rte_eth_conf struct used in the rte_eth_dev_configure() call to simplify the configuration.
- Due to the kmods required by Arkville, I froze the kernel after installing DPDK using: sudo apt-mark hold $(uname -r)
○ I am fairly certain this Arkville DPDK release is not the culprit since I previously had the NIC working with it.
And a few notes:
- I was able to print to a file all the flow rules set on our NIC. Not sure how to interpret these, if/how they can be disabled, but this appears to be the current configuration. FlowRules.txt (1.7 KB)
- Server 1 does NOT connect to the internet (I installed everything manually). This is actually the server I initially had things working on; I have not yet seen my application run on Server 2. This should mean that there is 0 chance that the root cause is from an update.
- The only mention of this error I could find online was from someone on Windows, but I am not familiar enough with OS’s to know if their fix is something I could try to translate over to Ubuntu..
Really hoping someone has encountered this or can steer me in the right direction as I am fresh out of ideas. Thanks in advance!