Hello,
I am new to infiniband. I have card connected to infiniband switch (the cables connectivity was done by professional, so there is reasonable chance to have cables and positions right).
I re-checked that the cable is connected, but led stays off for that particular port. Other machines (differend IB cards) connected to the switch work fine.
My system is Rocky linux 9.5 and I have ConnectX5 card and I tried to follow redhats infiniband manual to set it up [1], but I can’t find a way how to enable the device from down state.
Following are various outputs, which might help to diagnose the problem. Any hint what to check? Do I need some changes in BIOS / different firmware/ different cable/missing package?
Thanks!
$ mstconfig -d 42:00.0 q
Device #1:
----------
Device type: ConnectX5
Name: MCX556A-EDA_Ax_Bx
Description: ConnectX-5 Ex VPI adapter card; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0 x16; tall bracket; ROHS R6
....
ibstat shows:
ibstat
CA 'mlx5_0'
CA type: MT4121
Number of ports: 1
Firmware version: 16.35.4030
Hardware version: 0
Node GUID: 0x6cb3110300880eda
System image GUID: 0x6cb3110300880eda
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x6eb311fffe880eda
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4121
Number of ports: 1
Firmware version: 16.35.4030
Hardware version: 0
Node GUID: 0x6cb3110300880edb
System image GUID: 0x6cb3110300880eda
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x6eb311fffe880edb
Link layer: Ethernet
ibdevinfo gives:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.35.4030
node_guid: 6cb3:1103:0088:0eda
sys_image_guid: 6cb3:1103:0088:0eda
vendor_id: 0x02c9
vendor_part_id: 4121
hw_ver: 0x0
board_id: MT_0000000009
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.35.4030
node_guid: 6cb3:1103:0088:0edb
sys_image_guid: 6cb3:1103:0088:0eda
vendor_id: 0x02c9
vendor_part_id: 4121
hw_ver: 0x0
board_id: MT_0000000009
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
dmesg shows
$ dmesg | grep mlx
[ 1.989682] mlx5_core 0000:42:00.0: enabling device (0000 -> 0002)
[ 1.989898] mlx5_core 0000:42:00.0: firmware version: 16.35.4030
[ 1.989928] mlx5_core 0000:42:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:40:01.2 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 2.345552] mlx5_core 0000:42:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 2.345617] mlx5_core 0000:42:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[ 2.349751] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged
[ 2.350222] mlx5_core 0000:42:00.0: mlx5_pcie_event:301:(pid 12): PCIe slot advertised sufficient power (75W).
[ 2.577032] mlx5_core 0000:42:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[ 2.577061] mlx5_core 0000:42:00.0: is_dpll_supported:213:(pid 601): Missing SyncE capability
[ 2.580061] mlx5_core 0000:42:00.1: enabling device (0000 -> 0002)
[ 2.580283] mlx5_core 0000:42:00.1: firmware version: 16.35.4030
[ 2.580315] mlx5_core 0000:42:00.1: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:40:01.2 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 2.953572] mlx5_core 0000:42:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 2.953636] mlx5_core 0000:42:00.1: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[ 2.958096] mlx5_core 0000:42:00.1: Port module event: module 1, Cable unplugged
[ 2.958368] mlx5_core 0000:42:00.1: mlx5_pcie_event:301:(pid 1277): PCIe slot advertised sufficient power (75W).
[ 3.172748] mlx5_core 0000:42:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[ 3.172777] mlx5_core 0000:42:00.1: is_dpll_supported:213:(pid 601): Missing SyncE capability
[ 11.340398] mlx5_core 0000:42:00.0 enp66s0f0np0: renamed from eth3
[ 11.371323] mlx5_core 0000:42:00.1 enp66s0f1np1: renamed from eth5
[ 17.612114] mlx5_core 0000:42:00.0 enp66s0f0np0: Link down
[ 18.214302] mlx5_core 0000:42:00.1 enp66s0f1np1: Link down
msflint gives:
$ mstflint -d 00:42:00.0 q
Image type: FS4
FW Version: 16.35.4030
FW Release Date: 27.6.2024
Product Version: 16.35.4030
Rom Info: type=UEFI version=14.29.15 cpu=AMD64
type=PXE version=3.6.902 cpu=AMD64
Description: UID GuidsNumber
Base GUID: 6cb3110300880eda 8
Orig Base GUID: N/A 8
Base MAC: 6cb311880eda 8
Orig Base MAC: N/A 8
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000009
Security Attributes: N/A
ibaddr output:
$ ibaddr -C mlx5_0 -P 1
ibwarn: [969888] mad_rpc_open_port: client_register for mgmt 1 failed
ibaddr: iberror: failed: Failed to open 'mlx5_0' port '1'
[root@umaster rdma]# ibaddr -C mlx5_0 -P 0
ibwarn: [969889] mad_rpc_open_port: can't open UMAD port (mlx5_0:0)
ibaddr: iberror: failed: Failed to open 'mlx5_0' port '0'
[1] Configuring InfiniBand and RDMA networks | Red Hat Product Documentation