Missing directory with SR-IOV on ConnectX-4 Infinband with CentOS8 and OFED 5.1.2.5.8.0

I’ve been following https://support.mellanox.com/s/article/howto-configure-sr-iov-for-connect-ib-connectx-4-with-kvm–infiniband-x so I already know about these instructions.

I’m on step 6 where the GUID, port and policy are set.

However the target ‘directory’ that holds the VF configurations is missing:


[root@hostname ~]# ls -l /sys/class/infiniband/mlx5_0/device/sriov

ls: cannot access '/sys/class/infiniband/mlx5_0/device/sriov': No such file or directory

[root@hostname ~]# ls -l /sys/class/infiniband/mlx5_0/device/

total 0

-r--r--r-- 1 root root 4096 Mar 22 04:10 aer_dev_correctable

-r--r--r-- 1 root root 4096 Mar 22 04:10 aer_dev_fatal

-r--r--r-- 1 root root 4096 Mar 22 04:10 aer_dev_nonfatal

-r--r--r-- 1 root root 4096 Mar 22 04:10 ari_enabled

-rw-r--r-- 1 root root 4096 Mar 22 04:10 broken_parity_status

-r--r--r-- 1 root root 4096 Mar 22 04:09 class

-rw-r--r-- 1 root root 4096 Mar 22 04:09 config

-r--r--r-- 1 root root 4096 Mar 22 04:10 consistent_dma_mask_bits

-r--r--r-- 1 root root 4096 Mar 22 04:09 current_link_speed

-r--r--r-- 1 root root 4096 Mar 22 04:09 current_link_width

-rw-r--r-- 1 root root 4096 Mar 22 04:10 d3cold_allowed

-r--r--r-- 1 root root 4096 Mar 22 04:09 device

-r--r--r-- 1 root root 4096 Mar 22 04:10 dma_mask_bits

lrwxrwxrwx 1 root root 0 Mar 22 04:09 driver -> ../../../../bus/pci/drivers/mlx5_core

-rw-r--r-- 1 root root 4096 Mar 22 04:10 driver_override

-rw-r--r-- 1 root root 4096 Mar 22 04:10 enable

drwxr-xr-x 3 root root 0 Mar 22 04:09 infiniband

drwxr-xr-x 4 root root 0 Mar 22 04:09 infiniband_mad

drwxr-xr-x 3 root root 0 Mar 22 04:09 infiniband_verbs

lrwxrwxrwx 1 root root 0 Mar 22 04:10 iommu -> ../../../virtual/iommu/dmar4

lrwxrwxrwx 1 root root 0 Mar 22 04:10 iommu_group -> ../../../../kernel/iommu_groups/23

-r--r--r-- 1 root root 4096 Mar 22 04:09 irq

drwxr-xr-x 2 root root 0 Mar 22 04:10 link

-r--r--r-- 1 root root 4096 Mar 22 04:10 local_cpulist

-r--r--r-- 1 root root 4096 Mar 22 04:09 local_cpus

-r--r--r-- 1 root root 4096 Mar 22 04:10 max_link_speed

-r--r--r-- 1 root root 4096 Mar 22 04:10 max_link_width

-r--r--r-- 1 root root 4096 Mar 22 04:10 modalias

-rw-r--r-- 1 root root 4096 Mar 22 04:10 msi_bus

drwxr-xr-x 2 root root 0 Mar 22 04:09 msi_irqs

drwxr-xr-x 3 root root 0 Mar 22 04:09 net

-rw-r--r-- 1 root root 4096 Mar 22 04:09 numa_node

-r--r--r-- 1 root root 4096 Mar 22 04:10 pools

drwxr-xr-x 2 root root 0 Mar 22 04:10 power

drwxr-xr-x 3 root root 0 Mar 22 04:09 ptp

--w--w---- 1 root root 4096 Mar 22 04:10 remove

--w------- 1 root root 4096 Mar 22 04:10 rescan

--w------- 1 root root 4096 Mar 22 04:10 reset

-r--r--r-- 1 root root 4096 Mar 22 04:09 resource

-rw------- 1 root root 33554432 Mar 22 04:10 resource0

-rw------- 1 root root 33554432 Mar 22 04:10 resource0_wc

-r--r--r-- 1 root root 4096 Mar 22 04:10 revision

-rw------- 1 root root 1048576 Mar 22 04:10 rom

-rw-r--r-- 1 root root 4096 Mar 22 04:10 sriov_drivers_autoprobe

-rw-r--r-- 1 root root 4096 Mar 22 04:09 sriov_numvfs

-r--r--r-- 1 root root 4096 Mar 22 04:10 sriov_offset

-r--r--r-- 1 root root 4096 Mar 22 04:10 sriov_stride

-r--r--r-- 1 root root 4096 Mar 22 04:10 sriov_totalvfs

-r--r--r-- 1 root root 4096 Mar 22 04:10 sriov_vf_device

lrwxrwxrwx 1 root root 0 Mar 22 04:09 subsystem -> ../../../../bus/pci

-r--r--r-- 1 root root 4096 Mar 22 04:09 subsystem_device

-r--r--r-- 1 root root 4096 Mar 22 04:09 subsystem_vendor

-rw-r--r-- 1 root root 4096 Mar 22 04:09 uevent

-r--r--r-- 1 root root 4096 Mar 22 04:09 vendor

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn0 -> ../0000:18:00.1

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn1 -> ../0000:18:00.2

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn10 -> ../0000:18:01.3

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn11 -> ../0000:18:01.4

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn12 -> ../0000:18:01.5

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn13 -> ../0000:18:01.6

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn14 -> ../0000:18:01.7

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn15 -> ../0000:18:02.0

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn2 -> ../0000:18:00.3

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn3 -> ../0000:18:00.4

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn4 -> ../0000:18:00.5

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn5 -> ../0000:18:00.6

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn6 -> ../0000:18:00.7

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn7 -> ../0000:18:01.0

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn8 -> ../0000:18:01.1

lrwxrwxrwx 1 root root 0 Mar 22 04:10 virtfn9 -> ../0000:18:01.2

-rw------- 1 root root 0 Mar 22 04:10 vpd

We have a subnet manager running elsewhere, but have tried with opensmd running and not running with the same resullt.

The base system is CentOS8:


[root@hostname ~]# uname -a

Linux asn011 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Mon Mar 1 17:16:16 UTC 2021 x86_64 x86_64 x86_64 GNU/Linu

The driver version is:


[root@hostname ~]# rpm -qa | grep mlnx-ofed

mlnx-ofed51-modules-5.1.2.5.8.0.x86_64

We are not going to upgrade the drivers to 5.2.x as we have other systems on CentOS7 and older drivers with working SR-IOV, so we’re just trying to work through the CentOS8 changes first.

Hi,

Please refer to the updated documentation of SR-IOV in the latest MLNX_OFED 5.2-2.2.0.0

https://docs.mellanox.com/pages/viewpage.action?pageId=43718746

Thanks,

Samer

This answer was not helpful. I’ve already read this document and the document that is specific for the driver I am using, they’re like the first hits when googling SR-IOV & InfiniBand issues. It did not contain any troubleshooting instructions that would have diagnosed the problem. The document I linked previously actually contains a few more steps that are helpful (such as how to reset the card).

I have solved this issue BTW, the hosts were booting the wrong kernel and still using the CentOS ‘inbox’ driver. The key was figuring that out.

Response from a ‘good’ host:

[root@node001 ~]# modinfo mlx5_core | grep signer

signer: Mellanox Technologies signing key

Response from a ‘bad’ host with missing SR-IOV directory:

[root@node002 ~]# modinfo mlx5_core | grep signer

signer: CentOS kernel signing key

These hosts boot using PXE so I went and checked the host that served their boot images and checked and fixed that it was serving the correct kernel to them. While this part is out-of-scope for Mellanox support, I do think it’s reasonable that they could provide troubleshooting steps to establish if your host is running the correct kernel and loading the correct kernel modules, rather that just ‘RTFM’