DPDK: 'mlx5_net: failed to set defaults flows' on Ubuntu 22.04

I am looking for help on an issue integrating DPDK with a Mellanox ConnectX-6 NIC in our AMD EPYC server.

I will start by saying that things were previously working fine, but seemingly out of nowhere, the mlx5_core driver started logging this error during DPDK’s initialization in my application:
mlx5_net: port 0 failed to set defaults flows
This occurs when my application calls rte_eth_dev_start(), and then exits with error code 22.

Furthermore, when I run DPDK’s testpmd application, I again see a log related to flow; this doesn’t appear to be an error, but since testpmd is not code that I wrote and the log mentions flow configuration, I thought it could be pertinent:
testpmd: Flow tunnel offload support might be limited or unavailable on port 0.

I have actually installed DPDK on 2 servers. There are minor configuration differences on Server 2, but regardless, both machines exhibit the exact same error behaviour.

Server 1:
CPU: AMD EPYC 9354
OS: Ubuntu 22.04.4 LTS
Kernel: 5.15.0-116-generic
NIC: MCX653105A-HDA_Ax
NIC FW: 20.43.2566
Lspci: 01:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
DPDK: Arkville 24.11.1 (leverages several kmods to integrate the Arkville IP on FPGA cards with DPDK)
Dpdk-devbind: 0000:01:00.0 ‘MT28908 Family [ConnectX-6] 101b’ numa_node=0 if=enp1s0np0 drv=mlx5_core unused=vfio-pci

Server 2 differences:
OS: Ubuntu 22.04.5 LTS
Kernel: 5.15.0-142-generic
DPDK: Arkville 25.03

Things I have tried:

  • Cold booting the system
  • Unbinding/rebinding the mlx5_core driver
  • Updating/reverting NIC FW:
    ○ 20.43.2566 is the newest LTS version. I also tried reverting to 20.35.2000, which is the oldest LTS version
    ○ Ran verification with flint -d … verify; success.
    ○ Updated nVidia’s MFT tools from v4.30.1-113 to v4.32.0-120
    ○ I realized today that I should have installed the DOCA-Networking package rather than just MFT, so I uninstalled MFT and installed the latest DOCA-Networking (v3.0.0-058000-25)
  • Uninstalling/reinstalling DPDK
  • Removing all parameters from the rte_eth_conf struct used in the rte_eth_dev_configure() call to simplify the configuration.
  • Due to the kmods required by Arkville, I froze the kernel after installing DPDK using: sudo apt-mark hold $(uname -r)
    ○ I am fairly certain this Arkville DPDK release is not the culprit since I previously had the NIC working with it.

And a few notes:

  • I was able to print to a file all the flow rules set on our NIC. Not sure how to interpret these, if/how they can be disabled, but this appears to be the current configuration. FlowRules.txt (1.7 KB)
  • Server 1 does NOT connect to the internet (I installed everything manually). This is actually the server I initially had things working on; I have not yet seen my application run on Server 2. This should mean that there is 0 chance that the root cause is from an update.
  • The only mention of this error I could find online was from someone on Windows, but I am not familiar enough with OS’s to know if their fix is something I could try to translate over to Ubuntu..

Really hoping someone has encountered this or can steer me in the right direction as I am fresh out of ideas. Thanks in advance!

Hi,

Thanks for your question.
To find the root cause we will need to look at the setup and better understand all the components, the application configuration and testpmd command line. Collect and analyze logs.
This will require a support case to opened in Nvidia portal, then the case will be handled according to the entitlement.

Best Regards,
Anatoly

Thanks for the reply. Could you be more specific where I can find this portal please?

Hi, sure,
Please take a look at the below link:

Best Regards,
Anatoly