Low Bandwith with Connect6X Adapters

Hello,

i have a HPC Cluster with BeeGFS Storage mounted, when i test the write speed with dd on the BeeGFS mount from nodes which have a ConnectX5 card i can reach up to 6 GB/s.

When i do the same test on nodes with Connect6X Cards i only reach abou 2,7 GB/s.
I already read the Node Tuning Documentation but nothing of the recommended changes for tuning did help.

Setup:

Firmware all up to date
OS: Rocky Linux 8.10 on all nodes

Infiniband Switch : Mellanox QM8700

Fast Nodes 6GB/s with Connect5X Cards:

Node gpu0

Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Gold 6438M
BIOS Model name: Intel(R) Xeon(R) Gold 6438M
Stepping: 8
CPU MHz: 3730.295

Node login (virtualized)

Vendor ID: AuthenticAMD
BIOS Vendor ID: QEMU
CPU family: 25
Model: 1
Model name: AMD EPYC 7453 28-Core Processor
BIOS Model name: pc-i440fx-8.1
Stepping: 1
CPU MHz: 2749.998
BogoMIPS: 5499.99
Virtualization: AMD-V

Slow Nodes 3GB/s Connect 6x cards

gpu1
Model name: AMD EPYC 7662 64-Core Processor
BIOS Model name: AMD EPYC 7662 64-Core Processor
Stepping: 0
CPU MHz: 2000.000
CPU max MHz: 2154.2959
CPU min MHz: 1500.0000
BogoMIPS: 3999.74

gpu2:

Vendor ID: AuthenticAMD
BIOS Vendor ID: Advanced Micro Devices, Inc.
CPU family: 23
Model: 49
Model name: AMD EPYC 7662 64-Core Processor
BIOS Model name: AMD EPYC 7662 64-Core Processor
Stepping: 0
CPU MHz: 2000.000
CPU max MHz: 2154.2959
CPU min MHz: 1500.0000

Is there anything known specific maybe about this AMD EPYC Series and Connect6X Adapaters ?

But ib_write_bw get’s same speed results.

[root@gpu2 ~]# ib_write_bw 192.172.1.13 -p 1815
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0d QPN 0x0445 PSN 0xe330cc RKey 0x200c00 VAddr 0x007fbe59a6b000
 remote address: LID 0x06 QPN 0x096e PSN 0x2b944a RKey 0x004d90 VAddr 0x007f0b27954000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 2000.000000 != 3293.977000. CPU Frequency is not max.
 65536      5000             11220.83            11220.21                    0.179523
---------------------------------------------------------------------------------------
[root@gpu2 ~]# ib_write_bw 192.172.1.13 -p 1815ib_write_bw 192.172.1.13 -p 1815
[root@gpu2 ~]# ssh gpu0
Last login: Wed Nov  6 10:45:02 2024 from 192.168.1.21
[root@gpu0 ~]# ib_write_bw 192.172.1.13 -p 1815
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0a QPN 0x01a6 PSN 0xbcd64b RKey 0x21db00 VAddr 0x007f71c5bba000
 remote address: LID 0x06 QPN 0x096f PSN 0xb70ee4 RKey 0x004d00 VAddr 0x007f5240565000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             11508.22            11507.83                    0.184125
---------------------------------------------------------------------------------------

lspci shows16x bandwith for PCIe

[root@gpu1 ~]# lspci -vv -s a1:00.0
a1:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
        Subsystem: Mellanox Technologies Device 0009
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 159
        NUMA node: 1
        IOMMU group: 112
        Region 0: Memory at 6213e000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at b6400000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [48] Vital Product Data
                Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, dual-port QSFP56                                                                                        
                Read-only fields:
                        [PN] Part number: MCX653106A-HDAT
                        [EC] Engineering changes: AH
                        [V2] Vendor specific: MCX653106A-HDAT
                        [SN] Serial number: MT2244T00FCU
                        [V3] Vendor specific: 1a200186c255ed118000b83fd2a6c50c
                        [VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653106A 
                        [V0] Vendor specific: PCIeGen4 x16
                        [VU] Vendor specific: MT2244T00FCUMLNXS0D0F0
                        [RV] Reserved: checksum good, 1 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 08, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [1c0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [230 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [320 v1] Lane Margining at the Receiver <?>
        Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [420 v1] Data Link Feature <?>
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core

There is another post which is mentioning perfomance issues on Connect 6X Adapters:

But i dont know if its related to the old OS in that case …