How to stop and start rdma on CentOS 7?
My testing requires to pause the rdma connectivity and start it back. ifdown ib0 doesn’t stop the communication once it is established.
How to stop and start rdma on CentOS 7?
My testing requires to pause the rdma connectivity and start it back. ifdown ib0 doesn’t stop the communication once it is established.
[root@cn41 ~]# ibportstate --Ca 0xf452140300f83e50 --Port 17 disable
this is the correspoing Switch CA and port where cn41 been connected. but it seems to be not working, do we need to add any argument ?
HI there -
Have you tried rmmod/insmod the rdma modules?
Instead I tried this, ibportstate helps disabling the port but couldn’t enable it back
[hmarne@cn37 ~]$ ibstat
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 1
Firmware version: 2.30.8000
Hardware version: 1
Node GUID: 0x002590fffff76c70
System image GUID: 0x002590fffff76c73
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 94
LMC: 0
SM lid: 4
Capability mask: 0x02514868
Port GUID: 0x002590fffff76c71
Link layer: InfiniBand
[hmarne@cn37 ~]$ sudo /usr/sbin/ibportstate 94 1 disable
Initial CA PortInfo:
LinkState:…Active
PhysLinkState:…LinkUp
Lid:…94
SMLid:…4
LMC:…0
LinkWidthSupported:…1X or 4X
LinkWidthEnabled:…1X or 4X
LinkWidthActive:…4X
LinkSpeedSupported:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:…10.0 Gbps
LinkSpeedExtSupported:…14.0625 Gbps
LinkSpeedExtEnabled:…14.0625 Gbps
LinkSpeedExtActive:…14.0625 Gbps
Mkey:…
MkeyLeasePeriod:…0
ProtectBits:…0
StateChangeEnable:…0x00
LinkSpeedSupported:…0x01
LinkSpeedEnabled:…0x01
LinkSpeedActive:…0x00
Disable may be irreversible
After PortInfo set:
LinkState:…Active
PhysLinkState:…LinkUp
Lid:…94
SMLid:…4
LMC:…0
LinkWidthSupported:…1X or 4X
LinkWidthEnabled:…1X or 4X
LinkWidthActive:…4X
LinkSpeedSupported:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:…Extended speed
LinkSpeedExtSupported:…14.0625 Gbps
LinkSpeedExtEnabled:…14.0625 Gbps
LinkSpeedExtActive:…14.0625 Gbps
Mkey:…
MkeyLeasePeriod:…0
ProtectBits:…0
[hmarne@cn37 ~]$ ibstat
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 1
Firmware version: 2.30.8000
Hardware version: 1
Node GUID: 0x002590fffff76c70
System image GUID: 0x002590fffff76c73
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 94
LMC: 0
SM lid: 4
Capability mask: 0x02514868
Port GUID: 0x002590fffff76c71
Link layer: InfiniBand
[hmarne@cn37 ~]$ sudo /usr/sbin/ibportstate 94 1 query | grep -i state
ibwarn: [11055] mad_rpc_open_port: can’t open UMAD port ((null):0)
/usr/sbin/ibportstate: iberror: failed: Failed to open ‘(null)’ port ‘0’
[hmarne@cn37 ~]$
Yes - but did you try my suggestion?
disable -
rmmod rdma_ucm
rmmod rdma_cm
enable -
modprobe rdma_cm
mopdprobe rdma_ucm
*You may not have to remove or add rdma_cm depending on your use case.
thanks - steve
If you want to recoverably disable/enable remote CA port, you need to do that to switch peer port. If it’s back to back CA’s, then the only way to reenable the remote CA port will be via some out of band mechanism.
– Hal
I think that the openibd script exists on CentOS 7. Is it /etc/init.d/openibd ? If it does exist, you can do restart or stop and then start.
/etc/init.d/openibd restart
or
service openibd restart
This should do everything needed (including module reloading) for restarting.
– Hal
Stopping openibd requires removing of modules. Since lustre is mounted we can’t do that. We want the lustre and fuse to be mounted during the operation only it shouldn’t be able to do IO operations
You mean disabling the corresponding switch can help in later enabling it ?
Get Outlook for Android<https://aka.ms/ghei36 https://aka.ms/ghei36 >
Yes, as long as you do this from CA that is not being disabled since switch will still be accessible through other ports. Only thing this does is disable the egress switch port which is peer to remote CA. Then you should be able to re-enable it when desired.
I don’t know if this will accomplish what you need as I’m not sure of all the lustre interactions.
Can you try it and see what happens ?
Hi HAL
like may I know how to do this ? I need bring down the peer [switch port] i.e 17 of remote CA [cn41]
Switch: 0xf452140300f83e50 MF0;ime-mlx216-ib-sw-01:SX6512/L12/U1:
41 1 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 97 1 “sn31 HCA-2” ( )
41 2 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 168 1 “sn52 HCA-1” ( )
41 3 ==( Down/ Polling)==> “” ( )
41 4 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 100 1 “sn31 HCA-1” ( )
41 5 ==( Down/ Polling)==> “” ( )
41 6 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 167 1 “sn53 HCA-2” ( )
41 7 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 170 1 “cn43 HCA-1” ( )
41 8 ==( Down/ Polling)==> “” ( )
41 9 ==( Down/ Polling)==> “” ( )
41 10 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 166 1 “sn52 HCA-2” ( )
41 11 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 117 1 “cn42 HCA-1” ( )
41 12 ==( Down/ Polling)==> “” ( )
41 13 ==( Down/ Polling)==> “” ( )
41 14 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 151 1 “sn08 HCA-2” ( )
41 15 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 141 1 “sn22 HCA-4” ( )
41 16 ==( Down/ Polling)==> “” ( )
41 17 ==( 4X 14.0625 Gbps Active/ LinkUp)==> 120 1 “cn41 HCA-1” ( )
[root@cn41 ~]# ibstat
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 1
Firmware version: 2.30.8000
Hardware version: 1
Node GUID: 0x002590fffff76da4
System image GUID: 0x002590fffff76da7
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 120
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x002590fffff76da5
Link layer: InfiniBand
[root@cn41 ~]#
I’m not familiar with --Ca option to ibportstate. Try:
ibportstate 41 17 disable
from some machine other than cn41
Yes, you can identify switch peer ports via ibnetdiscover. In the example you’ve shown, switch is GUID 0xf452140300f83e50 LID 41 so you can do this using switch LID or switch GUID (-G option).
I was able to disable/enable from a different node
ibportstate 41 17 enable
thanks Hal, not let me check how my applications behaves