Setup infiniband on kubernetes

I have a k8s cluster and the worker nodes have mellanox connectx-5 nics. I would like to deploy some pods in k8s and run mpi in it.

I can see the mellanox pci device inside the container

lspci | grep -i mellanox

21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

I enabled security context in my pods (following this toturial https://community.mellanox.com/s/article/kubernetes-ipoib-sriov-networking-with-connectx4-connectx5):

securityContext:

privileged: true

capabilities:

add: [ “IPC_LOCK” ]

resources:

limits:

rdma/hca: 1

I tried installing the nic drivers but it fails during instllation because it can’t query device 21:00:0

./install --add-kernel-support

Detected sles15sp2 x86_64. Disabling installing 32bit rpms…

Note: This program will create mlnx-en TGZ for sles15sp2 under /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c directory.

See log file /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx_iso.2159_logs/mlnx_ofed_iso.2159.log

Checking if all needed packages are installed…

Building mlnx-en RPMS . Please wait…

Creating metadata-rpms for 4.12.14-197.78_9.1.60-cray_shasta_c …

WARNING: If you are going to configure this package as a repository, then please note

WARNING: that it contains unsigned rpms, therefore, you need to disable the gpgcheck

WARNING: by setting ‘gpgcheck=0’ in the repository conf file.

Created /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx-en-5.5-1.0.3.2-sles15-ext.tgz

rpm -e --allmatches --nodeps cray-libxpmem-devel-headers cray-libxpmem0 cray-libxpmem-devel cray-xpmem

Installing /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx-en-5.5-1.0.3.2-sles15-ext

/tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx-en-5.5-1.0.3.2-sles15-ext/install --force

Detected sles15sp2 x86_64. Disabling installing 32bit rpms…

Logs dir: /tmp/mlnx-en.818334.logs

General log file: /tmp/mlnx-en.818334.logs/general.log

This program will install the mlnx-en package on your machine.

Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.

Those packages are removed due to conflicts with mlnx-en, do not reinstall them.

Uninstalling MLNX_EN driver

Starting mlnx-en-5.5-1.0.3.2 installation …

Preparing… ########################################

mlnx-tools-5.2.0-0.55103 ########################################

Installing mlnx-en-utils 5.5 RPM

Preparing… ########################################

Updating / installing…

mlnx-en-utils-5.5-1.0.3.0.gf3bf963.sle########################################

Installing mlnx_en 5.5 RPM

Preparing… ########################################

Updating / installing…

mlnx_en-5.5-1.0.3.0.gf3bf963.kver.4.12########################################

depmod: WARNING: Ignored deprecated option -r

depmod: WARNING: could not open modules.order at /lib/modules/4.12.14-197.78_9.1.60-cray_shasta_c: No such file or directory

depmod: WARNING: could not open modules.builtin at /lib/modules/4.12.14-197.78_9.1.60-cray_shasta_c: No such file or directory

Installing mlnx-en-sources 5.5 RPM

Preparing… ########################################

Updating / installing…

mlnx-en-sources-5.5-1.0.3.0.gf3bf963.s########################################

Installing mlnx-en-doc 5.5 RPM

Preparing… ########################################

Updating / installing…

mlnx-en-doc-5.5-1.0.3.0.gf3bf963.sles1########################################

Installing user level RPMs:

Preparing… ########################################

ofed-scripts-5.5-OFED.5.5.1.0.3 ########################################

Preparing… ########################################

mstflint-4.16.0-1.55103 ########################################

Device (21:00.0):

21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Link Width: x16

PCI Link Speed: 8GT/s

Installation finished successfully.

Preparing… ################################# [100%]

Updating / installing…

1:mlnx-fw-updater-5.5-1.0.3.2 ################################# [100%]

Initializing…

Attempting to perform Firmware update…

Querying Mellanox devices firmware …

Device #1:


Device Type: N/A

Part Number: –

Description:

PSID:

PCI Device Name: 21:00.0

Port1 MAC: N/A

Port1 GUID: N/A

Port2 MAC: N/A

Port2 GUID: N/A

Versions: Current Available

FW –

Status: Failed to open device


-E- Failed to query 21:00.0 device, error : No such file or directory. MFE_CR_ERROR

Log File: /tmp/xu4EnDC0Ub

Real log file: /tmp/mlnx-en.818334.logs/fw_update.log

Configuring /etc/security/limits.conf.

To load the new driver, run:

/etc/init.d/mlnx-en.d restart

I would like to get some assistance to find out the issue the installer is having

thank you very much

Hi Masber,

Thank you for posting your question on our community. Based on the information provided, the driver installation has completed successfully:

" Device (21:00.0):

21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Link Width: x16

PCI Link Speed: 8GT/s

Installation finished successfully.

To load the new driver, run:

/etc/init.d/mlnx-en.d restart"

The error is reported while checking the latest firmware for the card. THus, in order to understand why it failed to retrieve device information, it would be great if you can share the following outputs:

#ethtool -i

#mst start

#mst status -v (To find name of MST device)

#flint -d q

In addition, please share the log /tmp/mlnx-en.818334.logs/fw_update.log

Thanks,

Namrata.

Hi Masber,

In addition to the above request, I would also like to bring to your notice that based on your contact details, unfortunately, we did not find a support contract for your Account in our database. Thus, if an in depth debug is required in order to address your issue, I would like to request contacting our contracts team for a valid support contract. The contracts team can be reached at the following email id → Networking-contracts@nvidia.com

Thanks,

Namrata.