How push full 100gbps port?

Hello, we have Dual AMD EPYC 7742 / 512GB DDR4 / 12x8TB NVMe (Kioxia CD6 7.68TB NVMePCIe4x4) / ConnectX6 (MCX653106A-HDA_Ax) 100gbps.

Ubuntu 22.04 64bit, Nginx v1.22. Content (hls video, many .ts chunks) stored on software raid0 for best read speed. Driver NIC - MLNX_OFED_LINUX-5.6-1.0.3.3, ethernet version, but in system have openibd (infiniband?).

FW NIC is updated too:
Device #1:

Device Type: ConnectX6
Part Number: MCX653106A-HDA_Ax
Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
PSID: MT_0000000225
PCI Device Name: /dev/mst/mt4123_pciconf1
Base MAC: 043f72a12dc6
Versions: Current Available
FW 20.33.1048 N/A
PXE 3.6.0502 N/A
UEFI 14.26.0017 N/A

Status: No matching image found

HyperThreading is off, Node Per Socket = 1 (total 2 NUMA NODES, L3 cache is disable). We try with NPS0, NPS4 with L3 cache but it not give perfomance.

have many tuning option in sysctl (bbr, fair queueing…), txqueuelen 20000, ring buffers rx/tx = 8192

rx-usecs: 1024 / rx-frames: 2048
tx-usecs: 3072 / tx-frames: 6144
high number coalesce give better softirq% load CPU (almost 2 times lower CPU irq% load)

because
/sys/class/net/enp161s0f0np0/device/numa_node = 1
then first nginx with start with 64 workers with affinity on core from numa node = 1 ( 64-123 core), we have native_queued_spin_lock_slowpath.part.0 < 5% in perf top, but maximum can push ~75gbps (~35k connection).
next, incrase wokers up to 128 - give ~80-82gbps and native_queued_spin_lock_slowpath.part.0 have is good, but in low traffic ( < 50-60gbps) 10-40%.
And now have workers is 256, worker_cpu_affinity auto, reuseport, sendfile on - 85gbps (one time have 92gbps maximum, but not stable). After more traffic we have crash, because non linear grow up proccess softirqd/nginx to 100% CPUs and after down traffic - 20-30% or more.

Question: how push 100gbps port stable ? now we make limited bandiwdth to 80gbps and have stable work server. Better server hardware can not be found, we have the simplest task of distributing files ready on the file system. How else can you optimize performance? We thought to add a NIC to socket = 0, but in our case there is no such possibility, 3 slots pcie in CPU2 (numa node=1) and one pcie CPU1 half lenght not possible put NIC…

on 80gbps we have load CPU: ~1100% system, 550% user, 2200% softirq, but after non linear grow ip IRQ%. Kernel TLS we tested, its not help us. Only decrease user%, but increase many system%.

Hi z3rom1nd3

I wonder if you went through the artiticle “Performance Tuning for Mellanox Adapters”
There are still several factors to consider except you did.

Please refer to below articles.

*reference
Performance Tuning for Mellanox Adapters (nvidia.com)

Tunings:

BIOS/iLO:

  1. HPC profile
  2. IOMMU disable
  3. SMT disable
  4. Determinism Control Manual → Performance Deterministic
  5. C state disable
  6. Preferred IO set to PCIe bus address
  7. NPS=1

Grub:

  1. iommu=pt
  2. numa_balancing=disable
  3. processor.max_cstate=0
  4. intel_pstate=disable
  5. intel_idle.max_cstate=0

mlxconfig tuning:

  1. ADVANCED_PCI_SETTINGS=1
  2. MAX_ACC_OUT_READ=32 and 46
  3. PCI_WR_ORDERING=1

Other tuning

  1. PCI MaxReadRequest 4096
  2. PCIe Max Payload 512
  3. MTU 9000
  4. Flow control disable
  5. Ring buffer 8192
  6. mlnx_tune -p HIGH_THROUGHPUT
  7. tuned-adm profile network-throughput

Hello, thanks u. This artiticle i read before…

about BIOS:
iommu is disable, smt disable, c state disable, nps=1

grub:
iommu=pt, numa_balancing its kernel options not grub! another options i put now
intel_pstate=disable and intel_idle.max_cstate=0 can put options if i have AMD EPYC ?

mlxconfig:
ADVANCED_PCI_SETTINGS = 0, change to 1
MAX_ACC_OUT_READ - not found in list options
PCI_WR_ORDERING is default 1 (force_relax(1))

other tuning:
PCI MaxReadRequest 4096, PCIe Max Payload 512, flow control disable, ring buffer 8k. Coalesce is high:

rx-usecs: 1024
rx-frames: 2048
tx-usecs: 2048
tx-frames: 4096

MTU is defailt 1500, for webserver nginx i can install 9000 ? all many client web use 1500 or i not understand this ? router can change to 9000, its not problem, but i afraid its will be problem to end clients (he not support 9k MTU)

For MAX_ACC_OUT_READ you will need to upgrade the MFT package.

now we set this options except numa_balancing, he is enable because we use all 128 core, on one numa node we can push only 70-75gbps. if we disable this option have instant grows softirq%, may be need in nginx bind affinity only core of numa node 1.

after 1-2 day after restart i found in statistic
rx_discards_phy: 189877
tx_discards_phy: 0
rx_corrected_bits_phy: 0
rx_err_lane_0_phy: 0
rx_err_lane_1_phy: 0
rx_err_lane_2_phy: 0
rx_err_lane_3_phy: 0

rx_discards_phy its problem with buffer NIC ? ring buffer set maximun tx 8196 rx 8196
tcp/ip buffer also is high:
net.core.rmem_default = 2147483647
net.core.wmem_default = 2147483647
net.core.rmem_max = 2147483647
net.core.wmem_max = 2147483647
net.ipv4.tcp_rmem = 4096 87380 2147483647
net.ipv4.tcp_wmem = 4096 65536 2147483647
net.core.optmem_max = 25165824

and i noticed often, almost every day after 80gbps non linear grows incomming traffic. not always, but in rush hour and after 80gbps. if normal on 80gbps ~ 3m packets and ~1.5-1.6gbit/s, we have sometimes to 4.5+m packets andn to 2.5-3gbps

i not know trust tcpdump info on 100gbps port, but he say many DUP ACK packets.

UPDATE:
rx_discards_phy: 497999
its parametrs increases only have problem… today we push 89gbps OUT and have 1.6gbps IN ~3M input packets… proccessor ~sys 1000%, user 5000%, irq 2200, softirq# threads 5-12%, next we have fall down to ~60gbps and sys ~ 1500-2000 and sometimes 40000%… softirq ~ 40000%, softirq# threads ~ 40%+
i cant understand why its hapenned, CPU have idle minimum 85000% on both CPU, numa node1 have 50-60% busy. may be its capped bottleneck infinity fabric between CPUs ? how check this problem ? i check if nginx binded only on numa1 and stay working sometimes ( 5 minutes) and after again turn cpu affinity both cpus its somotimes help resolve this freezy.

cpu-day
fw_packets-day (2)
if_enp161s0f0np0-day (4)
interrupts-day (1)
load-day
nginx_status-day (1)

atop in 85gbps

we have problem in RX traffic. TX not have error.

i show ethtool priv flags and found options:

rx_cqe_moder on
rx_striding_rq on
rx_no_csum_complete: off

maybe try switch they?

UPDATE:
i try this
// Fix RX performance on mixed traffic flows
/usr/bin/mcra $pci_addr 0x815e0.0 0
/usr/bin/mcra $pci_addr 0x81640.0 0
but cant see any changes

I think you should open a CASE to get the optimal tunning suitable for your environment.