IB Network becomes disabled

Hello,

I’m troubleshooting my previous problems (https://community.mellanox.com/s/question/0D51T00007TopADSAZ/openmpi-not-finding-the-device) step-by-step albeit in a different system. I’ve tried MOFED 5.0-2.1.8.0-ubuntu18.04-x86_64 & 5.0-1.0.0.0-ubuntu18.04-x86_64 but both seem to disable the IB network (attached are screenshots before and after). After installing MOFED, I restart the driver (/etc/init.d/openibd restart) that unloads and loads it as expected. However, it cannot load any MST device and starting it doesn’t work. After it, I checked the system network but the IB is disabled. However, I don’t know why this is happening, any suggestion on might be disabling the IB network? Thanks.

Hi Arturo,

Can you share the below information ?

  • Adapter type and P/N ?
  • output of lspci | grep Mell ?
  • OS Type and kernel version ?

From the above output looks like you are installing OFED in VM and not bare metal ?

"ConnectX-5 VF "

If you would like to use SR-IOV , all information about the IB interface should be in the baremetal and Following to the below SR-IOV guide you should attach the VF to the VM

https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connect-ib-connectx-4-with-kvm–infiniband-x

Thanks,

Samer

Hi Samer,

I didn’t copy the output from lspci so let me know if it’s really necessary. The information that I’ve got is from lshw:

*-network

description: Infiniband controller

product: MT27800 Family [ConnectX-5 Virtual Function]

vendor: Mellanox Technologies

physical id: 2

bus info: pci@de53:00:02.0

version: 00

width: 64 bits

clock: 33MHz

capabilities: bus_master cap_list

configuration: driver=mlx5_core latency=0

resources: iomemory:f0-ef irq:0 memory:fe0000000-fe1ffffff

I tried with Ubuntu 18.04 (the kernel was 5.3.0-1020-azure), which is when the network became disabled. Earlier today, I tried a fresh CentOS8.1 installation with a kernel 4.18.0-147.8.1.e18_1. Even though the network doesn’t become disabled with this OS, MST still doesn’t load.

sudo mst status -v

MST modules:


MST PCI module is not loaded

MST PCI configuration module is not loaded

No MST devices were found or MST modules are not loaded.

You may need to run ‘mst start’ to load MST modules.

[arturo@baseMOFEDCentOS MLNX_OFED_LINUX-5.0-2.1.8.0-rhel8.1-x86_64]$ sudo mst start

Starting MST (Mellanox Software Tools) driver set

Loading MST PCI module - Success

Loading MST PCI configuration module - Success

Create devices

Unloading MST PCI module (unused) - Success

Unloading MST PCI configuration module (unused) - Success

Thanks,

Arturo

Hi Arturo,

It is expected you are running on VM so you cannot query the physical firmware configuration using MFT , it doesn’t mean that there are no devices.

You need to check via the Hypervisor not the VM .

Is it linux or Azure ? the Hypervisor where the adapter is physically installed ?

Thanks,

Samer

Hello Samer,

Your comment is very interesting but how would you suggest to do so? Is there any documentation that you can point to? The (Microsoft) Azure documentation suggests two options on how to enable IB: (i) install the IB driver published by MS, (ii) install MOFED manually. I’m obviously trying the second one, but the Azure documentation doesn’t detail any pre or post step, and I’d need some pointer to follow your suggestion.

Thanks,

Arturo

Hi Arturu,

I recommend to open a support case by sending email to support@mellanox.com

and our support team will be happy to assist you in the debug of the azure environment.

Thanks,

Samer