Rocky 8 with MLNX_OFED, ib0 shows NO-CARRIER

I have two systems with similar hardware but almost the exact same issue, both are running Rocky 8.
The first, which I’ll call node1, is frozen at Rocky 8.8 for use with Bright Cluster Manager with an older Connect-IB. It has MLNX_OFED installed with version 4.9-
The second, which I’ll call node2, has Rocky 8.9 with a ConnectX-5. It also has MLNX_OFED installed with version 23.10-
Both show a good connection via ibstatus as shown below. Also the link lights on both ends of the cables are on.

Infiniband device 'mlx5_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:f452:1403:0071:d800
	base lid:	 0x1
	sm lid:		 0x2
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 56 Gb/sec (4X FDR)
	link_layer:	 InfiniBand

Infiniband device 'mlx5_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:0c42:a103:00c0:af08
	base lid:	 0x13
	sm lid:		 0x2
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 56 Gb/sec (4X FDR)
	link_layer:	 InfiniBand

Both have been configured via nmtui to include the proper address and network info with Datagram as the selected transport mode. However when checking the link status, via ip link, the following is given.

ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256
ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc mq state DOWN mode DEFAULT group default qlen 256

Any assistance is welcome.

Here is some info on work I have done so far. Note this is still not solved.

Via some searching I found this older post of mine. While the symptoms are very similar there is no warning via dmesg about connected mode. That said I did take a look at this as a possibility.
To do so I first found the documentation for the version of OFED I was using. If anyone from NVIDIA is looking at these please fix your docs site. Version 4.9 of the OFED drivers is not listed on the main docs page, I had to find it via Google. Taking a look at the IPoIB section it mentions a number of things. The Enhanced IPoIB is interesting but not usable due to the age of the card so I ignored it. I then checked the mode setting in /etc/infiniband/openib.conf as I have a Connect-IB in the first computer. The SET_IPOIB_CM config was set to auto so I changed it to no and rebooted. This changed nothing.

Still looking for ideas here so, again, any assistance is appreciated.

Hi Chris,
Thank you for posting your inquiry on the NVIDIA Developer Forum - Infrastructure and Networking - Section.

Let’s run ‘ip addr’ to make sure we have ip address assigned on the interface, and then try the following command to up the nic.

ip link set ib0 up

Also another good reference is that official doc from Redhat, which could be used to configure the IPoIB

Best Regards,

As stated in the first post the IP address was set via nmtui. I also checked the ifcfg scripts to verify that an IP address should be set and have used ip addr to double check that the system knows it has an IP.

As for the ip set link ib0 up command it is unnecessary. In the output of ip link/addr the following portion is given: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> I have bolded the relevant portion that shows the interface is already logically up leaving this command with nothing to do.

While the RHEL IPoIB page is helpful all the basic steps have already been completed for a DATAGRAM mode connection. Trying CONNECTED mode did not change anything. The only other thing I have yet to do was set a PKEY. However I have a default 0x0000 PKEY set up on the switch’s subnet manager for situations like these so it should not be needed.

After reading through a few manuals, including the one listed, I tried creating a sub-interface with the proper PKEY.
I used this command to create the sub-interface, the PKEY on the SM is 0x0002: echo 0x8002 > /sys/class/net/ib0/create_child
Then I verified that the interface had been created via ip link and it showed ib0.8002@ib0 as one of the interfaces.
Next I added an IP address to it with ip addr add dev ib0.8002. So far everything looks good with the IP address being displayed and the interface remaining offline.
However once I brought the sub-interface up, ip link set ib0 up, it failed with the same error as the main interface showing NO-CARRIER.

I’ve tried something different by swapping out the IB card in node1, an older Connect-IB, with a ConnectX-5. I did so by removing OFED, doing the HW swap, then reinstalling OFED with version 5.8.
After the swap I tried two things. The first was an EDR(100Gbps) DAC to see if that helped but I didn’t receive a link light on the cluster IB switch, which is FDR(56Gbps). So I plugged it into our other cluster’s IB switch, which is EDR, and everything just worked. Next I tried the original FDR DAC again in the FDR switch and the original issue to showed up again.

So to summarize so far, both a ConnectX-5 and a Connect-IB plugged into the same FDR switch and DAC fail to work with IPoIB by showing a NO-CARRIER error when running ip addr. Swapping out the Connect-IB for a ConnectX-5 shows the same issue. Testing that new card with an EDR switch and DAC works though. An EDR DAC on the FDR switch fails to get a link light, as expected.

I’m almost thinking this is a switch issue if it wasn’t working just fine under CentOS 7. Though perhaps it is a subnet manager issue. Are different settings required when using an EDR card with an FDR switch?

So after all of these attempted fixes I tried something that should have been done early on, I rebooted the switch. Apparently this worked though I have no idea why it worked. My only guess is the subnet manager needed to be restarted.
I’ll mark this as the solution and move on.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.