Connecting 8790 with 9790 Mellanox Switch to a node

Hi,
I have a HDR network, where I have multiple nodes connected. Now I am trying to connect a 9790 NDR switch to this HDR network with the help of a copper OSFP Finned to 2x QSFP cable. The 9790 switch is further connected by a node which has a Connect X-7 card, and now I want to make the node show up in the entire network. This is where the issues arise.

The switch and the NDR node show up in my network if I do ibswitches, however, on the node I can not get the infiniband interface to go up.
ibstatus shows:

CA 'ibp26s0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.38.1002
        Hardware version: 0
        Node GUID: 0xa088c203006043b6
        System image GUID: 0xa088c203006043b6
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 400
                Base lid: 109
                LMC: 0
                SM lid: 221
                Capability mask: 0xa751e848
                Port GUID: 0xa088c203006043b6
                Link layer: InfiniBand

whereas dmesg gives me the error:

ibp26s0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22

I have tried changing the MTU from 4092 to 2044 but that also does not work.

ibportstate gives:

ibportstate 109 1 
CA/RT PortInfo:
# Port info: Lid 109 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................109
SMLid:...........................221
LMC:.............................0
LinkWidthSupported:..............1X or 4X or 2X
LinkWidthEnabled:................1X or 4X or 2X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
LinkSpeedActive:.................Extended speed
LinkSpeedExtSupported:...........14.0625 Gbps or 25.78125 Gbps or 53.125 Gbps or 106.25 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps or 25.78125 Gbps or 53.125 Gbps or 106.25 Gbps
LinkSpeedExtActive:..............106.25 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0

This is all on debian12, with native debian drivers with an unconfigured, out of the box opensm. I want to know what am I missing? Is it a misconfiguration of opensm where NDR speeds are not allowed on HDR networks? Currently my opensm is running on a HDR node.

I have another node (Connect X-7 NDR) which is connected directly to the HDR 8790 switch via a flat OSFP to 2x QSFP cable and it works fine and I can use IPoIB and mount things over RDMA.

Do you know fw version of both switches?
Thanks,
Suo

Hi Suo,
For the NDR switch, it is 31.2012.4036
as for the HDR switch, it is

flint -d lid-95 q
Image type:            FS4
FW Version:            27.2012.4036
FW Release Date:       30.4.2024
Product Version:       27.2012.4036
Description:           UID                GuidsNumber
Base GUID:             1070fd0300652ba2        32
Base MAC:              1070fd652ba2            32
Image VSD:             N/A
Device VSD:            N/A
PSID:                  HPE0000000064
Security Attributes:   N/A

It is a HP Mellanox switch which is on the latest firmware provided by the vendor.

So where is your Opensm? What is the opensm version?

OpenSM is version 3.3.23.
I also see in the opensm logs, the following error:

mcmr_rcv_join_mgrp: ERR 1B12: validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed for MGID: ff12:401b:ffff::ffff:ffff port 0xa088c20300649f5c (gpu268 ibp26s0), sending IB_SA_MAD_STATUS_REQ_INVALID

Please try to use MLNX_SM 5.19

Thanks,
Suo

We did try that the issue there is that we want to mount nfs storage over RDMA, which does not work with the MLNX SM as far as I know. If I am wrong please feel free to correct me. Secondly, we are using debian and not redhat but I believe I saw the ibutils2 for debian as well, so that shouldn’t be an issue.

SM will not impact NFSoRDMA feature.
At the same time, nfsordma is supported in MLNX_OFED newer than versio 5.0.
It has been removed in MLNX_OFED 4.x, but added to 5.x again.
Thanks,
Suo

1 Like

Ah thank you. Yes that was my confusion. May I ask what the support matrix for MLNXSM is? Is it only supported on Redhat?

I am asking as I do not see the repository to add to my apt configuration. The key is there. Is it the same as Linux InfiniBand Drivers?

If so, that is the 4.x version from what I could see

Also for testing, I updated the opensm to 3.3.24 by making it from the git repo. the NDR IB does come up, but we can’t ssh into any other node over IB.

Hi,
I never use Opensm 3.x, so I can’t answer your question.
In my lab, I use MLNX-OFED, I didn’t met issue.

Thanks,
Suo

Dear Suo,
That is fine, I would love to try it as well but where do I get MLNX-OFED as dpkg? The link is only for .rpms

Ah nevermind, I just saw that the MLNXOPENSM is in the .DEBS of the mlnx_ofed and converted via alien! Thank you

For anyone who also struggles with this, opensm v 3.3.24 only partially works. It only allows packets of 316 bytes to be sent to NDR cards, and as a result you can’t ssh into the nodes. I kept everything stock but installed the MLNX_SM and that worked as expected.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.