Mellanox ConnectX-5 UDP Test and 100% load on a single CPU core

alletov · May 21, 2025, 3:53pm

Hi!
I have a test server with a Mellanox ConnectX-5 (MT27800 Family, 100G, dual-port QSFP28) network card installed. The server runs the Proxmox 8.4 hypervisor. For testing purposes, I created two Ubuntu 22.04 virtual machines.
I assigned one Mellanox port to each virtual machine using PCI Passthrough.
The two Mellanox ports are connected via an AOC 100G QSFP28 3m cable.
I’m trying to test the 100G channel bandwidth.

TCP Test:
On VM #1, I launch 8 instances of iperf3 in server mode:
iperf3 -s -p 3001 -D -4 ; ... ; iperf3 -s -p 3008 -D -4

On VM #2, I launch 8 instances of iperf3 in client mode:
iperf3 -c <vm1-ip> -Z -p 3001 ; ... ; iperf3 -c <vm1-ip> -Z -p 3008

During the test, VM #1 shows fairly even CPU core utilization.
The total bandwidth across all 8 streams is around 91Gb/s, which is acceptable for now.

UDP Test:
On VM #2, I launch 8 instances of iperf3 in client mode with UDP:
iperf3 -c <vm1-ip> -Z -p 3001 -u -b 15G -l 3500 ; ... ; iperf3 -c <vm1-ip> -Z -p 3008 -u -b 15G -l 3500

The total bandwidth across all 8 streams is only 24 Gb/s.
During the test, VM #1 shows 100% load on a single CPU core, while the others remain idle. The ksoftirqd process is causing this single-core bottleneck.

I understand that ksoftirqd handles IRQ interrupts and that multiple instances (one per CPU core) should distribute the load. However, in my case, interrupts are not being processed in parallel.

How can I fix this issue?

namrata1 · June 2, 2025, 5:45pm

Hi alletov,

Thank you for posting your query on NVIDIA Community!

In general, we recommend using iperf2 instead of iperf3 as iperf3 lacks multi-threading capability. This means to validate optimal results, you should use “taskset” to pin CPU"s closest to the NUMA to which the card is associated with.

Example:

First, run mst status -v (To check the interface where iperf test is running to map the NUMA node to which card is closely associated with)
Next, run #lscpu (To map the CPU’s part of NUMA node found in above command)

Run iperf as follows:

On Server side:
#takset -c <list of CPU’s from lscpu command> iperf -s

On Client side:
#taskset -c <list of CPU’s from lscpu command> iperf -c -P <number of threads equating to number of CPU’s used in taskset to ensure each thread is processed by single CPU>

With regards to UDP testing you mentioned, we cannot assure 0 packet loss for UDP traffic, because UDP is a connectionless protocol that cannot guarantee you will receive whatever you send:

There is no mechanism for retransmissions for dropped packets
There is no real ordering in kernel socket buffer
There are implementations in the OS which are outside of the driver behavior
No offloads for kernel UDP traffic (packets are not segmented/reassembled in SW)

There are general guidelines and tuning elements that can improve the behavior for UDP / BW:

Ring buffer should be the highest possible - ~8k (ethtool -G rx 8192 tx 8192)
OS socket buffers should be set to higher than default value:
sysctl net.core.rmem_max
sysctl -w net.core.rmem_max=33554432
sysctl net.core.rmem_default
sysctl -w net.core.rmem_default=33554432
Use latest MLNX_OFED driver and firmware version if possible
Disable C-states in the BIOS and the kernel.
To disable it in the kernel:
Add intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll into grub.conf.
Run the application on the cores that belongs to the Numa the adapter sits on
Incase flow control is not used, need to disable it using ethtool
Use the IRQ affinity script from Mellanox driver such as set_irq_affinity.sh (after
to avoid unnecessary pause framesdisablingirqbalanace service)
If only 1 UDP port is used better to decrease the number of Q’s to 1:
ethtool -L eth0 rx 1

Afterwards, check to which core the Q bound to by running “show_irq_affinity.sh eth0”, then try to bind it to different cores that belongs to the relevant Numa and see when the performance is better.
The core is changed by running: “echo 0 > /proc/irq/109/smp_affinity”
9. If more than 1 UDP port used, better to change the RSS hash policy from Toplitz to XOR for better distribution:
/opt/mellanox/ethtool/sbin/ethtool -X hfunc xor
10. Use jumbo frames if it is an option
11. Disable Hyper threading
12. If AMD CPU is in use, ensure GRUB is updated with “iommu=pt” and firmware parameter “PCI_WR_ORDERING” is set to 1, followed by reboot to take effect (mlxconfig -d set PCI_WR_ORDERING=1)

Thanks,
Namrata.

Topic		Replies	Views
100G Speed-tests VMWare Ethernet Switches	8	1017	August 9, 2017
Poor performance with ConnectX5 Ethernet Adapter Cards	5	1419	January 16, 2023
Cannot get 40Gbps on Ethernet mode with ConnectX-3 VPI Ethernet Adapter Cards	3	600	November 17, 2014
High ksoftirqd load on Mellanox ConnectX-6 DX on Linux Ethernet Adapter Cards	2	1154	February 20, 2024
Mellanox ConnectX-4 VPI in 100GbE ethernet mode cannot perform beyond ~52Gbps lspci	1	1193	March 14, 2017
Can anyone point me to a good example of using "iperf" with Mellanox Ethernet switch and HCAs? Mellanox OFED	3	639	June 6, 2013
One core 100% IRQ some times Ethernet Adapter Cards	30	2840	March 30, 2023
IPoIB Performance - ESXi 5.1 U1 cores	9	527	November 8, 2013
How push full 100gbps port? Ethernet Adapter Cards kernel , ubuntu	7	3808	May 3, 2024
Performance issue with inbox drivers Software And Drivers	1	716	February 26, 2022

Mellanox ConnectX-5 UDP Test and 100% load on a single CPU core

Related topics