How to stop and start rdma on CentOS 7?

How to stop and start rdma on CentOS 7?

My testing requires to pause the rdma connectivity and start it back. ifdown ib0 doesn’t stop the communication once it is established.

[root@cn41 ~]# ibportstate --Ca 0xf452140300f83e50 --Port 17 disable

this is the correspoing Switch CA and port where cn41 been connected. but it seems to be not working, do we need to add any argument ?

HI there -

Have you tried rmmod/insmod the rdma modules?

Instead I tried this, ibportstate helps disabling the port but couldn’t enable it back

[hmarne@cn37 ~]$ ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 1

Firmware version: 2.30.8000

Hardware version: 1

Node GUID: 0x002590fffff76c70

System image GUID: 0x002590fffff76c73

Port 1:

State: Active

Physical state: LinkUp

Rate: 56

Base lid: 94

LMC: 0

SM lid: 4

Capability mask: 0x02514868

Port GUID: 0x002590fffff76c71

Link layer: InfiniBand

[hmarne@cn37 ~]$ sudo /usr/sbin/ibportstate 94 1 disable

Initial CA PortInfo:

Port info: Lid 94 port 1

LinkState:…Active

PhysLinkState:…LinkUp

Lid:…94

SMLid:…4

LMC:…0

LinkWidthSupported:…1X or 4X

LinkWidthEnabled:…1X or 4X

LinkWidthActive:…4X

LinkSpeedSupported:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedEnabled:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedActive:…10.0 Gbps

LinkSpeedExtSupported:…14.0625 Gbps

LinkSpeedExtEnabled:…14.0625 Gbps

LinkSpeedExtActive:…14.0625 Gbps

Mkey:…

MkeyLeasePeriod:…0

ProtectBits:…0

MLNX ext Port info: Lid 94 port 1

StateChangeEnable:…0x00

LinkSpeedSupported:…0x01

LinkSpeedEnabled:…0x01

LinkSpeedActive:…0x00

Disable may be irreversible

After PortInfo set:

Port info: Lid 94 port 1

LinkState:…Active

PhysLinkState:…LinkUp

Lid:…94

SMLid:…4

LMC:…0

LinkWidthSupported:…1X or 4X

LinkWidthEnabled:…1X or 4X

LinkWidthActive:…4X

LinkSpeedSupported:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedEnabled:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedActive:…Extended speed

LinkSpeedExtSupported:…14.0625 Gbps

LinkSpeedExtEnabled:…14.0625 Gbps

LinkSpeedExtActive:…14.0625 Gbps

Mkey:…

MkeyLeasePeriod:…0

ProtectBits:…0

[hmarne@cn37 ~]$ ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 1

Firmware version: 2.30.8000

Hardware version: 1

Node GUID: 0x002590fffff76c70

System image GUID: 0x002590fffff76c73

Port 1:

State: Down

Physical state: Disabled

Rate: 10

Base lid: 94

LMC: 0

SM lid: 4

Capability mask: 0x02514868

Port GUID: 0x002590fffff76c71

Link layer: InfiniBand

[hmarne@cn37 ~]$ sudo /usr/sbin/ibportstate 94 1 query | grep -i state

ibwarn: [11055] mad_rpc_open_port: can’t open UMAD port ((null):0)

/usr/sbin/ibportstate: iberror: failed: Failed to open ‘(null)’ port ‘0’

[hmarne@cn37 ~]$

Yes - but did you try my suggestion?

disable -

rmmod rdma_ucm

rmmod rdma_cm

enable -

modprobe rdma_cm

mopdprobe rdma_ucm

*You may not have to remove or add rdma_cm depending on your use case.

thanks - steve

If you want to recoverably disable/enable remote CA port, you need to do that to switch peer port. If it’s back to back CA’s, then the only way to reenable the remote CA port will be via some out of band mechanism.

– Hal

I think that the openibd script exists on CentOS 7. Is it /etc/init.d/openibd ? If it does exist, you can do restart or stop and then start.

/etc/init.d/openibd restart

or

service openibd restart

This should do everything needed (including module reloading) for restarting.

– Hal

Stopping openibd requires removing of modules. Since lustre is mounted we can’t do that. We want the lustre and fuse to be mounted during the operation only it shouldn’t be able to do IO operations

You mean disabling the corresponding switch can help in later enabling it ?

Get Outlook for Android<https://aka.ms/ghei36 https://aka.ms/ghei36 >

Yes, as long as you do this from CA that is not being disabled since switch will still be accessible through other ports. Only thing this does is disable the egress switch port which is peer to remote CA. Then you should be able to re-enable it when desired.

I don’t know if this will accomplish what you need as I’m not sure of all the lustre interactions.

Can you try it and see what happens ?

Hi HAL

like may I know how to do this ? I need bring down the peer [switch port] i.e 17 of remote CA [cn41]

Switch: 0xf452140300f83e50 MF0;ime-mlx216-ib-sw-01:SX6512/L12/U1:

41 1 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 97 1 “sn31 HCA-2” ( )

41 2 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 168 1 “sn52 HCA-1” ( )

41 3 ==( Down/ Polling)==> “” ( )

41 4 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 100 1 “sn31 HCA-1” ( )

41 5 ==( Down/ Polling)==> “” ( )

41 6 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 167 1 “sn53 HCA-2” ( )

41 7 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 170 1 “cn43 HCA-1” ( )

41 8 ==( Down/ Polling)==> “” ( )

41 9 ==( Down/ Polling)==> “” ( )

41 10 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 166 1 “sn52 HCA-2” ( )

41 11 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 117 1 “cn42 HCA-1” ( )

41 12 ==( Down/ Polling)==> “” ( )

41 13 ==( Down/ Polling)==> “” ( )

41 14 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 151 1 “sn08 HCA-2” ( )

41 15 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 141 1 “sn22 HCA-4” ( )

41 16 ==( Down/ Polling)==> “” ( )

41 17 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 120 1 “cn41 HCA-1” ( )

[root@cn41 ~]# ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 1

Firmware version: 2.30.8000

Hardware version: 1

Node GUID: 0x002590fffff76da4

System image GUID: 0x002590fffff76da7

Port 1:

State: Active

Physical state: LinkUp

Rate: 56

Base lid: 120

LMC: 0

SM lid: 1

Capability mask: 0x02514868

Port GUID: 0x002590fffff76da5

Link layer: InfiniBand

[root@cn41 ~]#

I’m not familiar with --Ca option to ibportstate. Try:

ibportstate 41 17 disable

from some machine other than cn41

Yes, you can identify switch peer ports via ibnetdiscover. In the example you’ve shown, switch is GUID 0xf452140300f83e50 LID 41 so you can do this using switch LID or switch GUID (-G option).

I was able to disable/enable from a different node

ibportstate 41 17 enable

thanks Hal, not let me check how my applications behaves