Bluefield-2 network performance degradation

We have 2 servers with bluefield2 installed and connected by a 100G switch through bf2’s p1 interface. The bluefield2 is set to work in Embedded CPU Function Ownership Mode. We tested the bandwidth of the link and thing goes well at the beginning as we could get a result of about 93Gbps.

Then we tried VF QoS function mentioned in this tutorial on one of two Bluefield 2 (I’ll called it BF2-A in the following) and limited max egress rate of p1 and the other Bluefield 2 was untouched. QoS works as we expected and the egress rate of p1 on BF2-A is limited by our setting. Then we removed all QoS related settings on BF2-A, and the problem happened: we tested bandwidth using iperf2 again but the maximum egress rate of p1 on BF2-A degraded to only about 50Gbps and the ingress rate is still about 93Gbps.

We repeated the iperf test many times and the results were same. We thought that some QoS settings may not be removed completely so we reinstalled the BF2-A’s OS using BFB-install tool, but it didn’t help.

Then we wanted to switch BF2-A to separated host mode to see if this helps and another weird thing happened: we can’t actually switch the mode. We followed the instruction from nvidia’s tutorial ‘Modes of Operation’ and after rebooting both the host server and BF2-A we can see BF2-A is in separated mode according to the output of mlxconfig q. But the behavior of BF2-A shows it’s still working in ECPF mode. For example, the statistic info (n_packets and n_bytes) in the output of ‘ovs-ofctl dump-flows’ is still growing if there is network traffic between two host servers.

To make a summary:

  1. The egress bandwidth is degraded to 50Gbps and reinstalling the DPU OS doesn’t help;
  2. We can’t actually switch bluefield2 to separate host mode even if mlxconfig says we are already in.

Is there any suggestion on this situation?

Here is the output of ‘mlxconfig -d /dev/mst/mt41686_pciconf0.1 q’ (before switching to separate host mode):

Device #1:

Device type: BlueField2
Name: MBF2M516A-CEEO_Ax_Bx
Description: BlueField-2 E-Series DPU 100GbE Dual-Port QSFP56; PCIe Gen4 x16; Crypto Enabled; 16GB on-board DDR; 1GbE OOB management; FHHL
Device: /dev/mst/mt41686_pciconf0.1

Configurations: Next Boot
MEMIC_BAR_SIZE 0
MEMIC_SIZE_LIMIT _256KB(1)
HOST_CHAINING_MODE DISABLED(0)
HOST_CHAINING_CACHE_DISABLE False(0)
HOST_CHAINING_DESCRIPTORS Array[0…7]
HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0…7]
INTERNAL_CPU_MODEL EMBEDDED_CPU(1)
FLEX_PARSER_PROFILE_ENABLE 0
PROG_PARSE_GRAPH False(0)
FLEX_IPV4_OVER_VXLAN_PORT 0
ROCE_NEXT_PROTOCOL 254
ESWITCH_HAIRPIN_DESCRIPTORS Array[0…7]
ESWITCH_HAIRPIN_TOT_BUFFER_SIZE Array[0…7]
PF_BAR2_SIZE 3
PF_NUM_OF_VF_VALID False(0)
NON_PREFETCHABLE_PF_BAR False(0)
VF_VPD_ENABLE False(0)
PF_NUM_PF_MSIX_VALID False(0)
PER_PF_NUM_SF True(1)
STRICT_VF_MSIX_NUM False(0)
VF_NODNIC_ENABLE False(0)
NUM_PF_MSIX_VALID True(1)
NUM_OF_VFS 125
NUM_OF_PF 2
PF_BAR2_ENABLE False(0)
HIDE_PORT2_PF False(0)
SRIOV_EN True(1)
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 1
NUM_PF_MSIX 63
NUM_VF_MSIX 11
INT_LOG_MAX_PAYLOAD_SIZE AUTOMATIC(0)
PCIE_CREDIT_TOKEN_TIMEOUT 0
LAG_RESOURCE_ALLOCATION DEVICE_DEFAULT(0)
PHY_COUNT_LINK_UP_DELAY DELAY_NONE(0)
ACCURATE_TX_SCHEDULER False(0)
PARTIAL_RESET_EN False(0)
RESET_WITH_HOST_ON_ERRORS False(0)
NVME_EMULATION_ENABLE False(0)
NVME_EMULATION_NUM_VF 0
NVME_EMULATION_NUM_PF 1
NVME_EMULATION_VENDOR_ID 5555
NVME_EMULATION_DEVICE_ID 24577
NVME_EMULATION_CLASS_CODE 67586
NVME_EMULATION_REVISION_ID 0
NVME_EMULATION_SUBSYSTEM_VENDOR_ID 0
NVME_EMULATION_SUBSYSTEM_ID 0
NVME_EMULATION_NUM_MSIX 0
PCI_SWITCH_EMULATION_NUM_PORT 0
PCI_SWITCH_EMULATION_ENABLE False(0)
VIRTIO_NET_EMULATION_ENABLE False(0)
VIRTIO_NET_EMULATION_NUM_VF 0
VIRTIO_NET_EMULATION_NUM_PF 0
VIRTIO_NET_EMU_SUBSYSTEM_VENDOR_ID 6900
VIRTIO_NET_EMULATION_SUBSYSTEM_ID 1
VIRTIO_NET_EMULATION_NUM_MSIX 2
VIRTIO_BLK_EMULATION_ENABLE False(0)
VIRTIO_BLK_EMULATION_NUM_VF 0
VIRTIO_BLK_EMULATION_NUM_PF 0
VIRTIO_BLK_EMU_SUBSYSTEM_VENDOR_ID 6900
VIRTIO_BLK_EMULATION_SUBSYSTEM_ID 2
VIRTIO_BLK_EMULATION_NUM_MSIX 2
PCI_DOWNSTREAM_PORT_OWNER Array[0…15]
CQE_COMPRESSION BALANCED(0)
IP_OVER_VXLAN_EN False(0)
MKEY_BY_NAME False(0)
PRIO_TAG_REQUIRED_EN False(0)
UCTX_EN True(1)
REAL_TIME_CLOCK_ENABLE False(0)
RDMA_SELECTIVE_REPEAT_EN False(0)
PCI_ATOMIC_MODE PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
TUNNEL_ECN_COPY_DISABLE False(0)
LRO_LOG_TIMEOUT0 6
LRO_LOG_TIMEOUT1 7
LRO_LOG_TIMEOUT2 8
LRO_LOG_TIMEOUT3 13
LOG_TX_PSN_WINDOW 7
LOG_MAX_OUTSTANDING_WQE 7
TUNNEL_IP_PROTO_ENTROPY_DISABLE False(0)
ICM_CACHE_MODE DEVICE_DEFAULT(0)
TLS_OPTIMIZE False(0)
TX_SCHEDULER_BURST 0
ZERO_TOUCH_TUNING_ENABLE False(0)
ROCE_CC_LEGACY_DCQCN True(1)
LOG_MAX_QUEUE 17
LOG_DCR_HASH_TABLE_SIZE 11
DCR_LIFO_SIZE 16384
ROCE_CC_PRIO_MASK_P1 255
ROCE_CC_PRIO_MASK_P2 255
CLAMP_TGT_RATE_AFTER_TIME_INC_P1 True(1)
CLAMP_TGT_RATE_P1 False(0)
RPG_TIME_RESET_P1 300
RPG_BYTE_RESET_P1 32767
RPG_THRESHOLD_P1 1
RPG_MAX_RATE_P1 0
RPG_AI_RATE_P1 5
RPG_HAI_RATE_P1 50
RPG_GD_P1 11
RPG_MIN_DEC_FAC_P1 50
RPG_MIN_RATE_P1 1
RATE_TO_SET_ON_FIRST_CNP_P1 0
DCE_TCP_G_P1 1019
DCE_TCP_RTT_P1 1
RATE_REDUCE_MONITOR_PERIOD_P1 4
INITIAL_ALPHA_VALUE_P1 1023
MIN_TIME_BETWEEN_CNPS_P1 4
CNP_802P_PRIO_P1 6
CNP_DSCP_P1 48
CLAMP_TGT_RATE_AFTER_TIME_INC_P2 True(1)
CLAMP_TGT_RATE_P2 False(0)
RPG_TIME_RESET_P2 300
RPG_BYTE_RESET_P2 32767
RPG_THRESHOLD_P2 1
RPG_MAX_RATE_P2 0
RPG_AI_RATE_P2 5
RPG_HAI_RATE_P2 50
RPG_GD_P2 11
RPG_MIN_DEC_FAC_P2 50
RPG_MIN_RATE_P2 1
RATE_TO_SET_ON_FIRST_CNP_P2 0
DCE_TCP_G_P2 1019
DCE_TCP_RTT_P2 1
RATE_REDUCE_MONITOR_PERIOD_P2 4
INITIAL_ALPHA_VALUE_P2 1023
MIN_TIME_BETWEEN_CNPS_P2 4
CNP_802P_PRIO_P2 6
CNP_DSCP_P2 48
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 False(0)
KEEP_LINK_UP_ON_BOOT_P1 False(0)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
DO_NOT_CLEAR_PORT_STATS_P1 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P1 False(0)
KEEP_ETH_LINK_UP_P2 True(1)
KEEP_IB_LINK_UP_P2 False(0)
KEEP_LINK_UP_ON_BOOT_P2 False(0)
KEEP_LINK_UP_ON_STANDBY_P2 False(0)
DO_NOT_CLEAR_PORT_STATS_P2 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P2 False(0)
NUM_OF_VL_P1 _4_VLs(3)
NUM_OF_TC_P1 _8_TCs(0)
NUM_OF_PFC_P1 8
VL15_BUFFER_SIZE_P1 0
NUM_OF_VL_P2 _4_VLs(3)
NUM_OF_TC_P2 _8_TCs(0)
NUM_OF_PFC_P2 8
VL15_BUFFER_SIZE_P2 0
DUP_MAC_ACTION_P1 LAST_CFG(0)
MPFS_MC_LOOPBACK_DISABLE_P1 False(0)
MPFS_UC_LOOPBACK_DISABLE_P1 False(0)
UNKNOWN_UPLINK_MAC_FLOOD_P1 False(0)
SRIOV_IB_ROUTING_MODE_P1 LID(1)
IB_ROUTING_MODE_P1 LID(1)
DUP_MAC_ACTION_P2 LAST_CFG(0)
MPFS_MC_LOOPBACK_DISABLE_P2 False(0)
MPFS_UC_LOOPBACK_DISABLE_P2 False(0)
UNKNOWN_UPLINK_MAC_FLOOD_P2 False(0)
SRIOV_IB_ROUTING_MODE_P2 LID(1)
IB_ROUTING_MODE_P2 LID(1)
PF_TOTAL_SF 200
PF_SF_BAR_SIZE 10
PF_NUM_PF_MSIX 63
ROCE_CONTROL ROCE_ENABLE(2)
PCI_WR_ORDERING per_mkey(0)
MULTI_PORT_VHCA_EN False(0)
ECPF_ESWITCH_MANAGER ECPF(1)
ECPF_PAGE_SUPPLIER ECPF(1)
PORT_OWNER True(1)
ALLOW_RD_COUNTERS True(1)
RENEG_ON_CHANGE True(1)
TRACER_ENABLE True(1)
IP_VER IPv4(0)
BOOT_UNDI_NETWORK_WAIT 0
UEFI_HII_EN True(1)
BOOT_DBG_LOG False(0)
UEFI_LOGS DISABLED(0)
BOOT_VLAN 1
LEGACY_BOOT_PROTOCOL PXE(1)
BOOT_RETRY_CNT NONE(0)
BOOT_INTERRUPT_DIS False(0)
BOOT_LACP_DIS True(1)
BOOT_VLAN_EN False(0)
BOOT_PKEY 0
P2P_ORDERING_MODE DEVICE_DEFAULT(0)
EXP_ROM_VIRTIO_NET_PXE_ENABLE False(0)
EXP_ROM_VIRTIO_NET_UEFI_x86_ENABLE False(0)
EXP_ROM_VIRTIO_BLK_UEFI_x86_ENABLE False(0)
EXP_ROM_NVME_UEFI_x86_ENABLE True(1)
ATS_ENABLED False(0)
DYNAMIC_VF_MSIX_TABLE False(0)
EXP_ROM_UEFI_ARM_ENABLE True(1)
EXP_ROM_UEFI_x86_ENABLE True(1)
EXP_ROM_PXE_ENABLE True(1)
ADVANCED_PCI_SETTINGS False(0)
SAFE_MODE_THRESHOLD 10
SAFE_MODE_ENABLE True(1)

Hi,

Regarding the QoS settings. Actually it requires a more detailed investigation what exactly was configured, and how you reverted the configuration. In this situation we usually ask to open a support case in Nvidia portal.
In any case, all the sysfs configurations are only in runtime, and don’t survive a reboot, so to make sure that all of them reverted to the default, you can just reboot the DPU system.

Regarding changing from EMBEDDED to SEPARATED mode. After such change you may need to perform a cold boot of the server in order to make it working.

i.e.
This is the server where SEPARATED mode was configured, but will take effect only after the cold boot of the server:

[root@l-csi-rivermax-04 ~]# mlxconfig -d 33:00.0 -e q | grep -e ^Configurations -e INTERNAL_CPU_MODEL
Configurations: Default Current Next Boot
INTERNAL_CPU_MODEL EMBEDDED_CPU(1) EMBEDDED_CPU(1) EMBEDDED_CPU(1)
[root@l-csi-rivermax-04 ~]# mlxconfig -d 33:00.0 s INTERNAL_CPU_MODEL=0

Device #1:

Device type: BlueField2
Name: MBF2H516A-CENO_Ax_Bx
Description: BlueField-2 DPU 100GbE Dual-Port QSFP56; PCIe Gen4 x16; Crypto Disabled; 16GB on-board DDR; 1GbE OOB management; FHHL
Device: 33:00.0

Configurations: Next Boot New
INTERNAL_CPU_MODEL EMBEDDED_CPU(1) SEPARATED_HOST(0)

Apply new Configuration? (y/n) [n] : y
Applying… Done!
-I- Please reboot machine to load new configurations.
[root@l-csi-rivermax-04 ~]# mlxconfig -d 33:00.0 -e q | grep -e ^Configurations -e INTERNAL_CPU_MODEL
Configurations: Default Current Next Boot

  •    INTERNAL_CPU_MODEL                  EMBEDDED_CPU(1) EMBEDDED_CPU(1) SEPARATED_HOST(0)
    

[root@l-csi-rivermax-04 ~]#

Best Regards,
Anatoly

1 Like