High ksoftirqd load on Mellanox ConnectX-6 DX on Linux

I’m currently trying to get the 100G Mellanox ConnectX-6 DX working with Suricata on Debian Bookworm and when I send around 70Gbit/s of traffic via Cisco T-Rex from another machine I see a big spike on 63 ksoftirqd processes that stay at 100% once I set the link up via ip link set eth5 up.

I also ran perf top on that process and saw first that __nf_conntrack_alloc showed a big overhead, so I disabled iptables and conntrack (which I don’t need for the test) and now the overhead is mostly (65%) with native_queued_spin_lock_slowpath and I struggle to get an idea why the load is so high for the IRQs.

Some basics on the system for reference:

CPU (disabled HT)

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  256
  On-line CPU(s) list:   0-255
Vendor ID:               AuthenticAMD
  BIOS Vendor ID:        AMD
  Model name:            AMD EPYC 9754 128-Core Processor
    BIOS Model name:     AMD EPYC 9754 128-Core Processor                 CPU @ 2.2GHz
    BIOS CPU family:     107
    CPU family:          25
    Model:               160
    Thread(s) per core:  1
    Core(s) per socket:  128
    Socket(s):           2
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-127
  NUMA node1 CPU(s):     128-255

NIC

e1:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
e1:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
ethtool -i eth5
driver: mlx5_core
version: 6.1.0-17-amd64
firmware-version: 22.36.1010 (DEL0000000027)
expansion-rom-version:
bus-info: 0000:e1:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Kernel and boot parameter

BOOT_IMAGE=/boot/vmlinuz-6.1.0-17-amd64 root=/dev/mapper/root-root ro iommu=pt processor.max_cstate=0 numa_balancing=disable intel_pstate=disable intel_idle.max_cstate=0 quiet splas

I tried to disable all the offloading and I also ran the affinity script

set_irq_affinity_cpulist.sh 128-189 eth5

The cores matche the 63 queues the NIC provides on that interface/port. (Side note, why the hack does it have 63 queues and not a multiple of 2 :p?)

I can see the traffic being received and all that, but I’m wondering why the ksoftiqrd peaks at 100% all the time. Is 1Gbit/s per queue/core just too much for that NIC? But how should it handle 100G with the 63queues otherwise?

Thanks

Hi @norg_dev,

Thank you for posting your query on our community.

The ‘ksoftirqd’ process is a kernel thread allocated per CPU to manage heavy soft-interrupt loads. It is not consuming your CPU, rather it’s helping manage your IRQ load more efficiently. For a better understanding, you might find this link useful:

To further tune your server, we recommend following this article - ESPCommunity
Ensure that the RAM slots on the system board are fully populated. We have noticed that there is performance improvement when all DIMMs are populated when working with AMD processors.

Lastly, I notice that your adapter is showing a Dell PSID. I recommend reaching out to Dell support for further assistance.

Thanks,
Bhargavi

Hi,

thanks for your reply!

I already read more about the ksoftirqd and also saw the post you linked. But what makes me wonder is that the usage is that high with the Mellanox card, while the same traffic being forwarded to the same box but received by an Intel E810 (100G) NIC doesn’t show this. I can also see the clear diff when I run perf top -g on the system. Thus I would narrow it down to some diff in how it is handled by the Mellanox NIC (and/or the driver/firmware).

So I’m wondering if there is some specific different setting needed for the Mellanox driver or even the firmware.
Also the affinity script might vary.

Yes it’s in a DELL system, I already got the latest firmware from Dell to make sure I’m running on the latest release now.

I also installed the OFED drivers, is there any way to tell Linux to use those instead of the builtin kernel one?
Or is there another interface name instead of eth5 that should appear?