Infiniband connection failure

I have an hp cluster using InfiniBand running on Centos7. Previous there was no issue with the connection. I had a failed cable, and replaced it I now get a light again (good). My issue is I once had a connection ib0 that is no longer there. I tried to recreate it and I cannot start the connection seems there is some sort of naming issues.

[root@server ~]# lspci -Qvv | grep Mellanox
Product Name: Mellanox ConnectX-6 Single Port VPI HDR100 QSFP Adapter
[VE] Vendor specific: NMVMellanox Technologies, Inc.
[root@server ~]# lspci -Qvv | grep d8:00.0
d8:00.0 Class 0207: Device 15b3:101b
[root@server ~]# msflint -d d8:00.0 q
bash: msflint: command not found…
[root@server ~]# mstflint -d d8:00.0 q
Image type: FS4
FW Version: 20.35.1012
FW Release Date: 28.10.2022
Product Version: 20.35.1012
Rom Info: type=UEFI version=14.28.15 cpu=AMD64
type=PXE version=3.6.804 cpu=AMD64
Description: UID GuidsNumber
Base GUID: 1c34da0300734862 4
Base MAC: 1c34da734862 4
Image VSD: N/A
Device VSD: N/A
PSID: DEL0000000013
Security Attributes: secure-fw

Ib0 is the ipoib interface name.

what is the output of ‘ifconfig -a’ or ‘ip link show’

More on this:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ipoib

Hello,

thanks for the response, it does not show up here. I can see it in the idrac, see lights on there when its plugged in results are below:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 34:48:ed:f4:12:7c brd ff:ff:ff:ff:ff:ff
inet 10.141.250.10/16 brd 10.141.255.255 scope global noprefixroute em1
valid_lft forever preferred_lft forever
inet6 fe80::5bcd:a3a8:ebde:90a/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 34:48:ed:f4:12:7d brd ff:ff:ff:ff:ff:ff
inet 172.16.16.15/20 brd 172.16.31.255 scope global noprefixroute em2
valid_lft forever preferred_lft forever
inet6 fe80::1540:fe0c:6114:b0c0/64 scope link noprefixroute
valid_lft forever preferred_lft forever
4: em3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 34:48:ed:f4:12:7e brd ff:ff:ff:ff:ff:ff
5: em4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 34:48:ed:f4:12:7f brd ff:ff:ff:ff:ff:ff
6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:47:5b:59 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
7: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 1000
link/ether 52:54:00:47:5b:59 brd ff:ff:ff:ff:ff:ff

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 34:48:ed:f4:12:7c brd ff:ff:ff:ff:ff:ff
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 34:48:ed:f4:12:7d brd ff:ff:ff:ff:ff:ff
4: em3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 34:48:ed:f4:12:7e brd ff:ff:ff:ff:ff:ff
5: em4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether 34:48:ed:f4:12:7f brd ff:ff:ff:ff:ff:ff
6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:47:5b:59 brd ff:ff:ff:ff:ff:ff
7: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:47:5b:59 brd ff:ff:ff:ff:ff:ff

Can you please run

‘mst status -v’

if mst command is not found on the server you will need to install the MFT tools.

mst status -v
MST modules:

MST PCI module is not loaded
MST PCI configuration module loaded

PCI devices:

DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 d8:00.0

ok, so where do I go from here? swapped the card still doesnt show in the network devices. Should I try to reload the drivers, not sure why this would have happened here or where to go.

You are not seeing those because you haven’t run ‘mst start’.

I had tried this previously, It did not bring up the device, please see below. Do you have any other suggestions?
Bright Cluster Manager License expiration date: 07 Jul 2023
[root@master01 ~]# ssh ss01
root@ss01’s password:
Last login: Sat May 6 10:51:31 2023 from 172.16.16.11
[root@server ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success
[root@ss01 ~]# mst status -v
MST modules:

MST PCI module is not loaded
MST PCI configuration module loaded

PCI devices:

DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 d8:00.0 1

[root@server1 ~]# nmcli d
DEVICE TYPE STATE CONNECTION
em2 ethernet connected em2
em1 ethernet connected em1
virbr0 bridge connected virbr0
em3 ethernet unavailable –
em4 ethernet unavailable –
lo loopback unmanaged –
virbr0-nic tun unmanaged –
[root@server ~]# nmcli c
NAME UUID TYPE DEVICE
em2 8f4a9a9e-91ae-4abc-9249-08d5c919e9e9 ethernet em2
em1 259a73b9-2364-4bb0-a151-3e0b57538298 ethernet em1
virbr0 b079822c-5f8f-4466-a05c-ce887467b12f bridge virbr0
em3 8d79a661-5786-4590-8051-c38bee32a3ca ethernet –
em4 1a232114-a4fa-44a4-8513-0706fbef1f80 ethernet –
MLX1_ib0 cad27371-bdf3-4dd6-938c-0232550a9f5f infiniband –
[root@server ~]#

Hi nathan.backing,

Please kindly try to remove IPoIB module and add it back again.
rmmode ib_ipoib
modprobe ib_ipoib

If it doesn’t work, please use below Link to register to Enterprise Support Portal with a valid entitlement:

https://enterpriseproductregistration.nvidia.com/?LicType=COMMERCIAL&ProductFamily=Networking-HWSupport

After you complete the registration process with email, you will be able to login and access the portal

We are looking forward to hearing from you.

Thanks,
Yuying

I removed and re-added that IPoIB. That did not work, but I had one other comment. When I changed the device to ethernet mode I can get it to display in the devices without any issue. I don’t know, if that would shed any more light on anything. I am registered and tried to start up an case I was told that the card was sold through dell so contact dell. Dell, just wanted to send me a new card and it did not work so I was look at the forums for any additional Ideas that may help.

If you lost your IB interface, first thing might be to check the output of ibstat - that should show you whether your card is online at all -

Once you see the port with State: Active and Physical state: LinkUp, you can proceed with the IPoIB stuff - module loading and probably restarting the networkd.

Hi nathan.backing,

Thanks for the update.
You’re recommended to upgrade the MLNX OFED driver to the latest release and test it again if it works.
If not, please kindly replace the care as Dell advised.

Good luck.

Thanks,
Yuying

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.