Hi!
I have a test server with a Mellanox ConnectX-5 (MT27800 Family, 100G, dual-port QSFP28) network card installed. The server runs the Proxmox 8.4 hypervisor. For testing purposes, I created two Ubuntu 22.04 virtual machines.
I assigned one Mellanox port to each virtual machine using PCI Passthrough.
The two Mellanox ports are connected via an AOC 100G QSFP28 3m cable.
I’m trying to test the 100G channel bandwidth.
TCP Test:
On VM #1, I launch 8 instances of iperf3 in server mode: iperf3 -s -p 3001 -D -4 ; ... ; iperf3 -s -p 3008 -D -4
On VM #2, I launch 8 instances of iperf3 in client mode: iperf3 -c <vm1-ip> -Z -p 3001 ; ... ; iperf3 -c <vm1-ip> -Z -p 3008
During the test, VM #1 shows fairly even CPU core utilization.
The total bandwidth across all 8 streams is around 91Gb/s, which is acceptable for now.
UDP Test:
On VM #2, I launch 8 instances of iperf3 in client mode with UDP: iperf3 -c <vm1-ip> -Z -p 3001 -u -b 15G -l 3500 ; ... ; iperf3 -c <vm1-ip> -Z -p 3008 -u -b 15G -l 3500
The total bandwidth across all 8 streams is only 24 Gb/s.
During the test, VM #1 shows 100% load on a single CPU core, while the others remain idle. The ksoftirqd process is causing this single-core bottleneck.
I understand that ksoftirqd handles IRQ interrupts and that multiple instances (one per CPU core) should distribute the load. However, in my case, interrupts are not being processed in parallel.
Thank you for posting your query on NVIDIA Community!
In general, we recommend using iperf2 instead of iperf3 as iperf3 lacks multi-threading capability. This means to validate optimal results, you should use “taskset” to pin CPU"s closest to the NUMA to which the card is associated with.
Example:
First, run mst status -v (To check the interface where iperf test is running to map the NUMA node to which card is closely associated with)
Next, run #lscpu (To map the CPU’s part of NUMA node found in above command)
Run iperf as follows:
On Server side: #takset -c <list of CPU’s from lscpu command> iperf -s
On Client side: #taskset -c <list of CPU’s from lscpu command> iperf -c -P <number of threads equating to number of CPU’s used in taskset to ensure each thread is processed by single CPU>
With regards to UDP testing you mentioned, we cannot assure 0 packet loss for UDP traffic, because UDP is a connectionless protocol that cannot guarantee you will receive whatever you send:
There is no mechanism for retransmissions for dropped packets
There is no real ordering in kernel socket buffer
There are implementations in the OS which are outside of the driver behavior
No offloads for kernel UDP traffic (packets are not segmented/reassembled in SW)
There are general guidelines and tuning elements that can improve the behavior for UDP / BW:
Ring buffer should be the highest possible - ~8k (ethtool -G rx 8192 tx 8192)
OS socket buffers should be set to higher than default value:
sysctl net.core.rmem_max
sysctl -w net.core.rmem_max=33554432
sysctl net.core.rmem_default
sysctl -w net.core.rmem_default=33554432
Use latest MLNX_OFED driver and firmware version if possible
Disable C-states in the BIOS and the kernel.
To disable it in the kernel:
Add intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll into grub.conf.
Run the application on the cores that belongs to the Numa the adapter sits on
Incase flow control is not used, need to disable it using ethtool
Use the IRQ affinity script from Mellanox driver such as set_irq_affinity.sh (after
to avoid unnecessary pause framesdisablingirqbalanace service)
If only 1 UDP port is used better to decrease the number of Q’s to 1:
ethtool -L eth0 rx 1
Afterwards, check to which core the Q bound to by running “show_irq_affinity.sh eth0”, then try to bind it to different cores that belongs to the relevant Numa and see when the performance is better.
The core is changed by running: “echo 0 > /proc/irq/109/smp_affinity”
9. If more than 1 UDP port used, better to change the RSS hash policy from Toplitz to XOR for better distribution:
/opt/mellanox/ethtool/sbin/ethtool -X hfunc xor
10. Use jumbo frames if it is an option
11. Disable Hyper threading
12. If AMD CPU is in use, ensure GRUB is updated with “iommu=pt” and firmware parameter “PCI_WR_ORDERING” is set to 1, followed by reboot to take effect (mlxconfig -d set PCI_WR_ORDERING=1)