I have a k8s cluster and the worker nodes have mellanox connectx-5 nics. I would like to deploy some pods in k8s and run mpi in it.
I can see the mellanox pci device inside the container
lspci | grep -i mellanox
21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
I enabled security context in my pods (following this toturial https://community.mellanox.com/s/article/kubernetes-ipoib-sriov-networking-with-connectx4-connectx5):
securityContext:
privileged: true
capabilities:
add: [ “IPC_LOCK” ]
resources:
limits:
rdma/hca: 1
I tried installing the nic drivers but it fails during instllation because it can’t query device 21:00:0
./install --add-kernel-support
Detected sles15sp2 x86_64. Disabling installing 32bit rpms…
Note: This program will create mlnx-en TGZ for sles15sp2 under /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c directory.
See log file /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx_iso.2159_logs/mlnx_ofed_iso.2159.log
Checking if all needed packages are installed…
Building mlnx-en RPMS . Please wait…
Creating metadata-rpms for 4.12.14-197.78_9.1.60-cray_shasta_c …
WARNING: If you are going to configure this package as a repository, then please note
WARNING: that it contains unsigned rpms, therefore, you need to disable the gpgcheck
WARNING: by setting ‘gpgcheck=0’ in the repository conf file.
Created /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx-en-5.5-1.0.3.2-sles15-ext.tgz
rpm -e --allmatches --nodeps cray-libxpmem-devel-headers cray-libxpmem0 cray-libxpmem-devel cray-xpmem
Installing /tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx-en-5.5-1.0.3.2-sles15-ext
/tmp/mlnx-en-5.5-1.0.3.2-4.12.14-197.78_9.1.60-cray_shasta_c/mlnx-en-5.5-1.0.3.2-sles15-ext/install --force
Detected sles15sp2 x86_64. Disabling installing 32bit rpms…
Logs dir: /tmp/mlnx-en.818334.logs
General log file: /tmp/mlnx-en.818334.logs/general.log
This program will install the mlnx-en package on your machine.
Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
Those packages are removed due to conflicts with mlnx-en, do not reinstall them.
Uninstalling MLNX_EN driver
Starting mlnx-en-5.5-1.0.3.2 installation …
Preparing… ########################################
mlnx-tools-5.2.0-0.55103 ########################################
Installing mlnx-en-utils 5.5 RPM
Preparing… ########################################
Updating / installing…
mlnx-en-utils-5.5-1.0.3.0.gf3bf963.sle########################################
Installing mlnx_en 5.5 RPM
Preparing… ########################################
Updating / installing…
mlnx_en-5.5-1.0.3.0.gf3bf963.kver.4.12########################################
depmod: WARNING: Ignored deprecated option -r
depmod: WARNING: could not open modules.order at /lib/modules/4.12.14-197.78_9.1.60-cray_shasta_c: No such file or directory
depmod: WARNING: could not open modules.builtin at /lib/modules/4.12.14-197.78_9.1.60-cray_shasta_c: No such file or directory
Installing mlnx-en-sources 5.5 RPM
Preparing… ########################################
Updating / installing…
mlnx-en-sources-5.5-1.0.3.0.gf3bf963.s########################################
Installing mlnx-en-doc 5.5 RPM
Preparing… ########################################
Updating / installing…
mlnx-en-doc-5.5-1.0.3.0.gf3bf963.sles1########################################
Installing user level RPMs:
Preparing… ########################################
ofed-scripts-5.5-OFED.5.5.1.0.3 ########################################
Preparing… ########################################
mstflint-4.16.0-1.55103 ########################################
Device (21:00.0):
21:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Link Width: x16
PCI Link Speed: 8GT/s
Installation finished successfully.
Preparing… ################################# [100%]
Updating / installing…
1:mlnx-fw-updater-5.5-1.0.3.2 ################################# [100%]
Initializing…
Attempting to perform Firmware update…
Querying Mellanox devices firmware …
Device #1:
Device Type: N/A
Part Number: –
Description:
PSID:
PCI Device Name: 21:00.0
Port1 MAC: N/A
Port1 GUID: N/A
Port2 MAC: N/A
Port2 GUID: N/A
Versions: Current Available
FW –
Status: Failed to open device
-E- Failed to query 21:00.0 device, error : No such file or directory. MFE_CR_ERROR
Log File: /tmp/xu4EnDC0Ub
Real log file: /tmp/mlnx-en.818334.logs/fw_update.log
Configuring /etc/security/limits.conf.
To load the new driver, run:
/etc/init.d/mlnx-en.d restart
I would like to get some assistance to find out the issue the installer is having
thank you very much