Mlx5_0/mlx5_1 down

Hello,

I am new to infiniband. I have card connected to infiniband switch (the cables connectivity was done by professional, so there is reasonable chance to have cables and positions right).
I re-checked that the cable is connected, but led stays off for that particular port. Other machines (differend IB cards) connected to the switch work fine.

My system is Rocky linux 9.5 and I have ConnectX5 card and I tried to follow redhats infiniband manual to set it up [1], but I can’t find a way how to enable the device from down state.

Following are various outputs, which might help to diagnose the problem. Any hint what to check? Do I need some changes in BIOS / different firmware/ different cable/missing package?

Thanks!

$ mstconfig -d 42:00.0 q
Device #1:
----------
Device type:        ConnectX5
Name:               MCX556A-EDA_Ax_Bx
Description:        ConnectX-5 Ex VPI adapter card; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0 x16; tall bracket; ROHS R6
....

ibstat shows:

ibstat
CA 'mlx5_0'
        CA type: MT4121
        Number of ports: 1
        Firmware version: 16.35.4030
        Hardware version: 0
        Node GUID: 0x6cb3110300880eda
        System image GUID: 0x6cb3110300880eda
        Port 1: 
                State: Down
                Physical state: Disabled
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x6eb311fffe880eda
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4121
        Number of ports: 1
        Firmware version: 16.35.4030
        Hardware version: 0
        Node GUID: 0x6cb3110300880edb
        System image GUID: 0x6cb3110300880eda
        Port 1: 
                State: Down
                Physical state: Disabled
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x6eb311fffe880edb
                Link layer: Ethernet

ibdevinfo gives:

ibv_devinfo 
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.4030
        node_guid:                      6cb3:1103:0088:0eda
        sys_image_guid:                 6cb3:1103:0088:0eda
        vendor_id:                      0x02c9
        vendor_part_id:                 4121
        hw_ver:                         0x0
        board_id:                       MT_0000000009
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.4030
        node_guid:                      6cb3:1103:0088:0edb
        sys_image_guid:                 6cb3:1103:0088:0eda
        vendor_id:                      0x02c9
        vendor_part_id:                 4121
        hw_ver:                         0x0
        board_id:                       MT_0000000009
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

dmesg shows

$ dmesg | grep mlx 
[    1.989682] mlx5_core 0000:42:00.0: enabling device (0000 -> 0002)
[    1.989898] mlx5_core 0000:42:00.0: firmware version: 16.35.4030
[    1.989928] mlx5_core 0000:42:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:40:01.2 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    2.345552] mlx5_core 0000:42:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[    2.345617] mlx5_core 0000:42:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[    2.349751] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged
[    2.350222] mlx5_core 0000:42:00.0: mlx5_pcie_event:301:(pid 12): PCIe slot advertised sufficient power (75W).
[    2.577032] mlx5_core 0000:42:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[    2.577061] mlx5_core 0000:42:00.0: is_dpll_supported:213:(pid 601): Missing SyncE capability
[    2.580061] mlx5_core 0000:42:00.1: enabling device (0000 -> 0002)
[    2.580283] mlx5_core 0000:42:00.1: firmware version: 16.35.4030
[    2.580315] mlx5_core 0000:42:00.1: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:40:01.2 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    2.953572] mlx5_core 0000:42:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[    2.953636] mlx5_core 0000:42:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[    2.958096] mlx5_core 0000:42:00.1: Port module event: module 1, Cable unplugged
[    2.958368] mlx5_core 0000:42:00.1: mlx5_pcie_event:301:(pid 1277): PCIe slot advertised sufficient power (75W).
[    3.172748] mlx5_core 0000:42:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[    3.172777] mlx5_core 0000:42:00.1: is_dpll_supported:213:(pid 601): Missing SyncE capability
[   11.340398] mlx5_core 0000:42:00.0 enp66s0f0np0: renamed from eth3
[   11.371323] mlx5_core 0000:42:00.1 enp66s0f1np1: renamed from eth5
[   17.612114] mlx5_core 0000:42:00.0 enp66s0f0np0: Link down
[   18.214302] mlx5_core 0000:42:00.1 enp66s0f1np1: Link down

msflint gives:

$ mstflint -d 00:42:00.0 q
Image type:            FS4
FW Version:            16.35.4030
FW Release Date:       27.6.2024
Product Version:       16.35.4030
Rom Info:              type=UEFI version=14.29.15 cpu=AMD64
                       type=PXE version=3.6.902 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             6cb3110300880eda        8
Orig Base GUID:        N/A                     8
Base MAC:              6cb311880eda            8
Orig Base MAC:         N/A                     8
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000009
Security Attributes:   N/A

ibaddr output:

$ ibaddr -C mlx5_0 -P 1
ibwarn: [969888] mad_rpc_open_port: client_register for mgmt 1 failed
ibaddr: iberror: failed: Failed to open 'mlx5_0' port '1'
[root@umaster rdma]# ibaddr -C mlx5_0 -P 0
ibwarn: [969889] mad_rpc_open_port: can't open UMAD port (mlx5_0:0)
ibaddr: iberror: failed: Failed to open 'mlx5_0' port '0'

[1] Configuring InfiniBand and RDMA networks | Red Hat Product Documentation

MCX556A-EDA is set to Ethernet mode, so you have to change it to Infiniband mode by command

mlxconfig -d /dev/mst/mt41686_pciconf0 set LINK_TYPE_P1=1

by command

mlxconfig -d /dev/mst/mt4123_pciconf0 q
you can see list of parametrs

I do not have mlxconfig installed here, but

 mstconfig -d 41:00.0 set LINK_TYPE_P2=1

works indeed. After reboot I see now Active state:

CA 'mlx5_0'
        CA type: MT4121
        Number of ports: 1
        Firmware version: 16.35.4030
        Hardware version: 0
        Node GUID: 0x6cb3110300880fa8
        System image GUID: 0x6cb3110300880fa8
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 14
                LMC: 0
                SM lid: 1
                Capability mask: 0xa659e848
                Port GUID: 0x6cb3110300880fa8
                Link layer: InfiniBand

I expected that I’ll started to se ib0 interface on my system, but it does not show up.

Looking at dmesg I see some errors related to workqueue and ib0:

 dmesg |grep -E 'mlx|ipoib|ib0'
[    1.949268] mlx5_core 0000:41:00.0: enabling device (0000 -> 0002)
[    1.949469] mlx5_core 0000:41:00.0: firmware version: 16.35.4030
[    1.949498] mlx5_core 0000:41:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[    2.201158] mlx5_core 0000:41:00.0: Port module event: module 0, Cable plugged
[    2.201431] mlx5_core 0000:41:00.0: mlx5_pcie_event:301:(pid 12): PCIe slot advertised sufficient power (75W).
[    2.209530] mlx5_core 0000:41:00.0: is_dpll_supported:213:(pid 703): Missing SyncE capability
[    2.212524] mlx5_core 0000:41:00.1: enabling device (0000 -> 0002)
[    2.212745] mlx5_core 0000:41:00.1: firmware version: 16.35.4030
[    2.212776] mlx5_core 0000:41:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[    2.463871] mlx5_core 0000:41:00.1: Port module event: module 1, Cable unplugged
[    2.464136] mlx5_core 0000:41:00.1: mlx5_pcie_event:301:(pid 1232): PCIe slot advertised sufficient power (75W).
[    2.468469] mlx5_core 0000:41:00.1: is_dpll_supported:213:(pid 703): Missing SyncE capability
[   11.829453] workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
[   11.829461] ib0: failed to allocate device WQ
[   11.829463] mlx5_0: failed to initialize device: ib0 port 1 (ret = -12)
[   11.829466] mlx5_0: couldn't register ipoib port 1; error -12
[   12.118740] workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR
[   12.131672] mlx5_1, 1: ipoib_intf_alloc failed -12

But this problem is really about rocky linux. Similar problem was reported at [Interface fails to initialize properly when driver is included in initramfs - Red Hat Customer Portal]. The bottom line is that modprobe queue is killed by systemd when it switches from initramfs to booted system. Reloading the module after boot creates the ib interface.

modprobe -r mlx5_ib 
modprobe  mlx5_ib 

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.