How push full 100gbps port?

Hello, we have Dual AMD EPYC 7742 / 512GB DDR4 / 12x8TB NVMe (Kioxia CD6 7.68TB NVMePCIe4x4) / ConnectX6 (MCX653106A-HDA_Ax) 100gbps.

Ubuntu 22.04 64bit, Nginx v1.22. Content (hls video, many .ts chunks) stored on software raid0 for best read speed. Driver NIC - MLNX_OFED_LINUX-5.6-1.0.3.3, ethernet version, but in system have openibd (infiniband?).

FW NIC is updated too:
Device #1:

Device Type: ConnectX6
Part Number: MCX653106A-HDA_Ax
Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
PSID: MT_0000000225
PCI Device Name: /dev/mst/mt4123_pciconf1
Base MAC: 043f72a12dc6
Versions: Current Available
FW 20.33.1048 N/A
PXE 3.6.0502 N/A
UEFI 14.26.0017 N/A

Status: No matching image found

HyperThreading is off, Node Per Socket = 1 (total 2 NUMA NODES, L3 cache is disable). We try with NPS0, NPS4 with L3 cache but it not give perfomance.

have many tuning option in sysctl (bbr, fair queueing…), txqueuelen 20000, ring buffers rx/tx = 8192

rx-usecs: 1024 / rx-frames: 2048
tx-usecs: 3072 / tx-frames: 6144
high number coalesce give better softirq% load CPU (almost 2 times lower CPU irq% load)

because
/sys/class/net/enp161s0f0np0/device/numa_node = 1
then first nginx with start with 64 workers with affinity on core from numa node = 1 ( 64-123 core), we have native_queued_spin_lock_slowpath.part.0 < 5% in perf top, but maximum can push ~75gbps (~35k connection).
next, incrase wokers up to 128 - give ~80-82gbps and native_queued_spin_lock_slowpath.part.0 have is good, but in low traffic ( < 50-60gbps) 10-40%.
And now have workers is 256, worker_cpu_affinity auto, reuseport, sendfile on - 85gbps (one time have 92gbps maximum, but not stable). After more traffic we have crash, because non linear grow up proccess softirqd/nginx to 100% CPUs and after down traffic - 20-30% or more.

Question: how push 100gbps port stable ? now we make limited bandiwdth to 80gbps and have stable work server. Better server hardware can not be found, we have the simplest task of distributing files ready on the file system. How else can you optimize performance? We thought to add a NIC to socket = 0, but in our case there is no such possibility, 3 slots pcie in CPU2 (numa node=1) and one pcie CPU1 half lenght not possible put NIC…

on 80gbps we have load CPU: ~1100% system, 550% user, 2200% softirq, but after non linear grow ip IRQ%. Kernel TLS we tested, its not help us. Only decrease user%, but increase many system%.