Connect X-6 card LED not turning on and ports are down always

Hello,

  1. I purchased this card NVIDIA Mellanox MCX653106A-ECAT ConnectX®-6 InfiniBand/VPI Adapter Card 100GbE/HDR100/EDR, Dual-Port QSFP56, PCIe 4.0 x 16, Tall Bracket - FS.com
    to run RDMA. I am using the MCPM200 cable to connect the server to an other system that has SFP on that end.

  2. I installed the drivers from Mellanox OFED (MLNX_OFED) Software: End-User | NVIDIA Developer

ofed_info -s output is MLNX_OFED_LINUX-23.10-1.1.9.0

test@dev-server:~$ ibdev2netdev 
mlx5_0 port 1 ==> ibp66s0f0 (Down)
mlx5_1 port 1 ==> ibp66s0f1 (Down)
test@dev-server:~$ lspci | grep "Mellanox"
42:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
42:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				20.39.2048
	node_guid:			58a2:e103:00b9:3414
	sys_image_guid:			58a2:e103:00b9:3414
	vendor_id:			0x02c9
	vendor_part_id:			4123
	hw_ver:				0x0
	board_id:			MT_0000000224
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				20.39.2048
	node_guid:			58a2:e103:00b9:3415
	sys_image_guid:			58a2:e103:00b9:3414
	vendor_id:			0x02c9
	vendor_part_id:			4123
	hw_ver:				0x0
	board_id:			MT_0000000224
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		65535
			port_lmc:		0x00
			link_layer:		InfiniBand
  1. Tried setting the port to ethernet but it did not work:
mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1=2
-E- Failed to open the device
test@dev-server:~$ mlxconfig -d /dev/mst/mt4123_pciconf0 reset

 Reset configuration for device /dev/mst/mt4123_pciconf0? (y/n) [n] : y
Applying... Failed!
-E- Failed to open the device
  1. mst commands:
test@dev-server:~$ sudo mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module is not loaded

PCI Devices:
------------

42:00.0

test@dev-server:~$ mlxfwmanager
-E- No devices found or specified, mst might be stopped, run 'mst start' to load MST modules
test@dev-server:~$ sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
test@dev-server:~$ mlxfwmanager
-E- No devices found or specified, mst might be stopped, run 'mst start' to load MST modules
test@dev-server:~$ sudo mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4123_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:42:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

test@dev-server:~$ mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1=2
-E- Failed to open the device

Motherboard used:
ASUS Pro WS WRX80E-SAGE SE WIFI II
AMD WRX80

Hi chris.hoffman,

Thanks for posting your inquiry to the NVIDIA Developer Forums.

Do note that this motherboard+adapter combination is not tested. We do not test with consumer-grade system boards, furthermore this device is intended for use with server grade systems with adequate airflow. Installation of this adapter in a system without adequate airflow (see specifications) can cause the device to overheat and shut down, and eventually damage the device.

You may see entries in dmesg related to temperature if this is the case.

As far as triage of this issue goes, we recommend the following:

  1. Restart the OFED driver - /etc/init.d/openibd restart
  2. Restart MFT - mst restart
  3. Cold boot/power cycle the system.
  4. Ensure that a subnet manager is running - the subnet manager is required for Infiniband connections to establish.

See if the problem persists. If so:

  1. Reseat the adapter, or try using the adapter in a different PCIe slot.
  2. Attempt use of a known-good adapter in this system.
  3. Contact your vendor (FS.com) for support, unless you have purchased support entitlement with NVIDIA for this device - in which case, please open a support ticket with NVIDIA Enterprise Experience (https://enterprise-support.nvidia.com/s/create-case).

Again, this product is not tested with consumer-grade motherboards, so this device may or may not work with your system (even if airflow requirements are satisfied). We highly recommend installing this device in a server-grade system.

Best regards,
NVIDIA Enterprise Experience

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.