ConnectX6 (mlx5 kernel driver) strange behavior?

Hi

I’ve been testing the speed of my 100g setup with iperf3 and I have an unexplained ‘issue’.

What I’m doing:

  • mellanox drivers (mlnx-en-5.6-2.0.9.0-ubuntu22.04-x86_64)
  • multiple parallel client/server processes with iperf3
  • numa pinning
  • increased various (tcp) memory buffers
  • cpu governor set to performance

Out of the box the aggregate speed is then ~45gbit.

One I change any settings that relate to hardware, like the rx/tx buffers (default 1024) to any higher or lower number, and then revert to default, then the performance bumps to ~90gbit.
eg,
default (boot) (rx and tx are both set to 1024) => 45gbit
ethtool -G enp129s0f0np0 rx 100 tx 100 → ~90gbit
ethtool -G enp129s0f0np0 rx 1024 tx 1024 → ~90gbit

Similar behavior is observed when just changing the port MTU from 1500 (default) to 9000 and back to 1500.

The “problem” is somewhere on the RX data path I believe.
Can someone please help me make some sense of this?

Linux node113 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

[ 0.805346] pci 0000:81:00.0: [15b3:101b] type 00 class 0x020000
[ 0.805475] pci 0000:81:00.0: reg 0x10: [mem 0x5816e000000-0x5816fffffff 64bit pref]
[ 0.805761] pci 0000:81:00.0: reg 0x30: [mem 0xb2300000-0xb23fffff pref]
[ 0.806357] pci 0000:81:00.0: PME# supported from D3cold
[ 0.844530] pci 0000:81:00.0: Adding to iommu group 123
[ 1.896377] mlx5_core 0000:81:00.0: firmware version: 20.33.1048
[ 1.899670] mlx5_core 0000:81:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[ 2.254623] mlx5_core 0000:81:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 2.256238] mlx5_core 0000:81:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[ 2.260273] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[ 2.261886] mlx5_core 0000:81:00.0: mlx5_pcie_event:295:(pid 10): PCIe slot power capability was not advertised.
[ 2.287577] mlx5_core 0000:81:00.0: mlx5_fw_tracer_start:821:(pid 588): FWTracer: Ownership granted and active
[ 2.730661] mlx5_core 0000:81:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 2.924007] mlx5_core 0000:81:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[ 3.155111] mlx5_core 0000:81:00.0 enp129s0f0np0: renamed from eth0
[ 6.507168] mlx5_core 0000:81:00.0 enp129s0f0np0: Link up

root@node113:~/ofed# mstlink -d 81:00.0 --cable --ddm

Operational Info

State : Active
Physical state : ETH_AN_FSM_ENABLE
Speed : 100G
Width : 4x
FEC : Standard RS-FEC - RS(528,514)
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed (Ext.) : 0x000007f2 (100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.) : 0x00000200 (100G_4X)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed.

Cable DDM Information

Temperature : 48C
Voltage : 0.3292V
Channels : Channel 1 ,Channel 2 ,Channel 3 ,Channel 4
RX Power : 2.000dBm ,2.000dBm ,2.000dBm ,2.000dBm
TX Power : 2.000dBm ,2.000dBm ,2.000dBm ,2.000dBm
TX Bias : 40.632mA ,40.496mA ,39.678mA ,40.906mA

Device #1:

Device type: ConnectX6
Name: MCX653106A-ECA_Ax
Description: ConnectX-6 VPI adapter card; H100Gb/s (HDR100; EDR IB and 100GbE); dual-port QSFP56; PCIe3.0 x16; tall bracket; ROHS R6
Device: 81:00.0

Configurations: Next Boot
MEMIC_BAR_SIZE 0
MEMIC_SIZE_LIMIT _256KB(1)
HOST_CHAINING_MODE DISABLED(0)
HOST_CHAINING_CACHE_DISABLE False(0)
HOST_CHAINING_DESCRIPTORS Array[0…7]
HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0…7]
FLEX_PARSER_PROFILE_ENABLE 0
FLEX_IPV4_OVER_VXLAN_PORT 0
ROCE_NEXT_PROTOCOL 254
ESWITCH_HAIRPIN_DESCRIPTORS Array[0…7]
ESWITCH_HAIRPIN_TOT_BUFFER_SIZE Array[0…7]
PF_BAR2_SIZE 0
NON_PREFETCHABLE_PF_BAR False(0)
VF_VPD_ENABLE False(0)
PF_NUM_PF_MSIX_VALID False(0)
PER_PF_NUM_SF False(0)
STRICT_VF_MSIX_NUM False(0)
VF_NODNIC_ENABLE False(0)
NUM_PF_MSIX_VALID True(1)
NUM_OF_VFS 0
NUM_OF_PF 2
PF_BAR2_ENABLE False(0)
SRIOV_EN False(0)
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 1
NUM_PF_MSIX 63
NUM_VF_MSIX 11
INT_LOG_MAX_PAYLOAD_SIZE AUTOMATIC(0)
PCIE_CREDIT_TOKEN_TIMEOUT 0
ACCURATE_TX_SCHEDULER False(0)
PARTIAL_RESET_EN False(0)
RESET_WITH_HOST_ON_ERRORS False(0)
DISABLE_SLOT_POWER_LIMITER True(1)
ADVANCED_POWER_SETTINGS True(1)
CQE_COMPRESSION BALANCED(0)
IP_OVER_VXLAN_EN False(0)
MKEY_BY_NAME False(0)
PRIO_TAG_REQUIRED_EN False(0)
UCTX_EN True(1)
PCI_ATOMIC_MODE PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
TUNNEL_ECN_COPY_DISABLE False(0)
LRO_LOG_TIMEOUT0 6
LRO_LOG_TIMEOUT1 7
LRO_LOG_TIMEOUT2 8
LRO_LOG_TIMEOUT3 13
LOG_TX_PSN_WINDOW 7
LOG_MAX_OUTSTANDING_WQE 7
TUNNEL_IP_PROTO_ENTROPY_DISABLE False(0)
ICM_CACHE_MODE DEVICE_DEFAULT(0)
TX_SCHEDULER_BURST 0
LOG_DCR_HASH_TABLE_SIZE 11
DCR_LIFO_SIZE 16384
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
ROCE_CC_PRIO_MASK_P1 255
ROCE_CC_PRIO_MASK_P2 255
CLAMP_TGT_RATE_AFTER_TIME_INC_P1 True(1)
CLAMP_TGT_RATE_P1 False(0)
RPG_TIME_RESET_P1 300
RPG_BYTE_RESET_P1 32767
RPG_THRESHOLD_P1 1
RPG_MAX_RATE_P1 0
RPG_AI_RATE_P1 5
RPG_HAI_RATE_P1 50
RPG_GD_P1 11
RPG_MIN_DEC_FAC_P1 50
RPG_MIN_RATE_P1 1
RATE_TO_SET_ON_FIRST_CNP_P1 0
DCE_TCP_G_P1 1019
DCE_TCP_RTT_P1 1
RATE_REDUCE_MONITOR_PERIOD_P1 4
INITIAL_ALPHA_VALUE_P1 1023
MIN_TIME_BETWEEN_CNPS_P1 4
CNP_802P_PRIO_P1 6
CNP_DSCP_P1 48
CLAMP_TGT_RATE_AFTER_TIME_INC_P2 True(1)
CLAMP_TGT_RATE_P2 False(0)
RPG_TIME_RESET_P2 300
RPG_BYTE_RESET_P2 32767
RPG_THRESHOLD_P2 1
RPG_MAX_RATE_P2 0
RPG_AI_RATE_P2 5
RPG_HAI_RATE_P2 50
RPG_GD_P2 11
RPG_MIN_DEC_FAC_P2 50
RPG_MIN_RATE_P2 1
RATE_TO_SET_ON_FIRST_CNP_P2 0
DCE_TCP_G_P2 1019
DCE_TCP_RTT_P2 1
RATE_REDUCE_MONITOR_PERIOD_P2 4
INITIAL_ALPHA_VALUE_P2 1023
MIN_TIME_BETWEEN_CNPS_P2 4
CNP_802P_PRIO_P2 6
CNP_DSCP_P2 48
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 False(0)
KEEP_LINK_UP_ON_BOOT_P1 False(0)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
DO_NOT_CLEAR_PORT_STATS_P1 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P1 False(0)
KEEP_ETH_LINK_UP_P2 True(1)
KEEP_IB_LINK_UP_P2 False(0)
KEEP_LINK_UP_ON_BOOT_P2 False(0)
KEEP_LINK_UP_ON_STANDBY_P2 False(0)
DO_NOT_CLEAR_PORT_STATS_P2 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P2 False(0)
NUM_OF_VL_P1 _4_VLs(3)
NUM_OF_TC_P1 _8_TCs(0)
NUM_OF_PFC_P1 8
VL15_BUFFER_SIZE_P1 0
NUM_OF_VL_P2 _4_VLs(3)
NUM_OF_TC_P2 _8_TCs(0)
NUM_OF_PFC_P2 8
VL15_BUFFER_SIZE_P2 0
DUP_MAC_ACTION_P1 LAST_CFG(0)
UNKNOWN_UPLINK_MAC_FLOOD_P1 False(0)
SRIOV_IB_ROUTING_MODE_P1 LID(1)
IB_ROUTING_MODE_P1 LID(1)
DUP_MAC_ACTION_P2 LAST_CFG(0)
UNKNOWN_UPLINK_MAC_FLOOD_P2 False(0)
SRIOV_IB_ROUTING_MODE_P2 LID(1)
IB_ROUTING_MODE_P2 LID(1)
PF_TOTAL_SF 0
PF_SF_BAR_SIZE 0
PF_NUM_PF_MSIX 63
ROCE_CONTROL ROCE_ENABLE(2)
PCI_WR_ORDERING per_mkey(0)
MULTI_PORT_VHCA_EN False(0)
PORT_OWNER True(1)
ALLOW_RD_COUNTERS True(1)
RENEG_ON_CHANGE True(1)
TRACER_ENABLE True(1)
IP_VER IPv4(0)
BOOT_UNDI_NETWORK_WAIT 0
UEFI_HII_EN True(1)
BOOT_DBG_LOG False(0)
UEFI_LOGS DISABLED(0)
BOOT_VLAN 1
LEGACY_BOOT_PROTOCOL PXE(1)
BOOT_RETRY_CNT NONE(0)
BOOT_INTERRUPT_DIS False(0)
BOOT_LACP_DIS True(1)
BOOT_VLAN_EN False(0)
BOOT_PKEY 0
P2P_ORDERING_MODE DEVICE_DEFAULT(0)
ATS_ENABLED False(0)
DYNAMIC_VF_MSIX_TABLE False(0)
EXP_ROM_UEFI_ARM_ENABLE True(1)
EXP_ROM_UEFI_x86_ENABLE True(1)
EXP_ROM_PXE_ENABLE True(1)
ADVANCED_PCI_SETTINGS False(0)
SAFE_MODE_THRESHOLD 10
SAFE_MODE_ENABLE True(1)

Module Info

Identifier : QSFP28
Compliance : 100GBASE-LR4 or 25GBASE-LR
Cable Technology : 1310 nm DFB
Cable Type : Optical Module (separated)
OUI : Other
Vendor Name : FINISARCORP.
Vendor Part Number : FTLC1154RDPL-A5
Vendor Serial Number : U6EAGPE
Rev : A0
Wavelength [nm] : 1302
Transfer Distance [m] : 0
Attenuation (5g,7g,12g) [dB] : N/A
FW Version : N/A
Digital Diagnostic Monitoring : Yes
Power Class : 3.5 W max
CDR RX : ON,ON,ON,ON
CDR TX : ON,ON,ON,ON
LOS Alarm : N/A
Temperature [C] : 51 [-5…75]
Voltage [mV] : 3288.9 [2970…3630]
Bias Current [mA] : 40.596,40.122,39.750,40.724 [25…55]
Rx Power Current [dBm] : 2,2,2,2 [-14…6]
Tx Power Current [dBm] : 2,2,4,2 [-8…8]

It is very strange, with default ring buffers (1024) we on 45gbits have drops, after increase to 8k we can push ~80-85gbps. MTU not changed, i think it can change if router supported.