nvidia-kmod-396.26-2.el7.x86_64 refuse to recognise my 7 M2090 Tesla cards on CentOS-7.5.1804

I’ve recently updated my CentOS-7 kernel and kmod-nvidia and since then, the driver refuses to load with the following message:

modprobe -v nvidia

insmod /lib/modules/3.10.0-862.3.3.el7.x86_64/extra/nvidia.ko.xz
modprobe: ERROR: could not insert ‘nvidia’: No such device

cat /sys/devices/pci0000:80/0000:80:03.0/0000:85:00.0/0000:86:04.0/0000:88:00.0/modalias

pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00

modinfo nvidia

filename: /lib/modules/3.10.0-862.3.3.el7.x86_64/extra/nvidia.ko.xz
alias: char-major-195-*
version: 396.26
supported: external
license: NVIDIA
retpoline: Y
rhelversion: 7.5
srcversion: AE579930EF8F20A66867263
alias: pci:v000010DEd00000E00svsdbc04sc80i00*
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends: ipmi_msghandler,i2c-core
vermagic: 3.10.0-862.3.3.el7.x86_64 SMP mod_unload modversions
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_CheckPCIConfigSpace:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_UseThreadedInterrupts:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_EnableBacklightHandler:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_AssignGpus:charp

=> second alias is correct, so it should load!
What’s wrong? (M2090 latest driver version on driver download page is 396.26)

kmod-nvidia was installed from:

cat /etc/yum.repos.d/cuda.repo

[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub

cat /proc/version

Linux version 3.10.0-862.3.3.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Jun 15 04:15:27 UTC 2018

]# find /sys/devices |grep modalias|xargs grep 10DE
/sys/devices/pci0000:00/0000:00:03.0/0000:0f:00.0/0000:10:04.0/0000:12:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
/sys/devices/pci0000:00/0000:00:03.0/0000:0f:00.0/0000:10:08.0/0000:11:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
/sys/devices/pci0000:00/0000:00:07.0/0000:0a:00.0/0000:0b:08.0/0000:0c:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
/sys/devices/pci0000:80/0000:80:03.0/0000:85:00.0/0000:86:04.0/0000:88:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
/sys/devices/pci0000:80/0000:80:03.0/0000:85:00.0/0000:86:08.0/0000:87:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
/sys/devices/pci0000:80/0000:80:07.0/0000:81:00.0/0000:82:04.0/0000:84:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
/sys/devices/pci0000:80/0000:80:07.0/0000:81:00.0/0000:82:08.0/0000:83:00.0/modalias:pci:v000010DEd00001091sv000010DEsd00000887bc03sc02i00
:-(

Ok,

I’ve progressed and what I’ve found confirm that proprietary software is crap :-(

1/ Seraching to M2090 drivers on this page gives version 396.26 which seems wrong: CRAP
http://www.nvidia.com/download/driverResults.aspx/134379/en-us

2/ Looking at dmesg I see this:
NVRM: The NVIDIA Tesla M2090 GPU installed in this system is
NVRM: supported through the NVIDIA 390.xx Legacy drivers. Please
NVRM: visit http://www.nvidia.com/object/unix.html for more
NVRM: information. The 396.26 NVIDIA driver will ignore
NVRM: this GPU. Continuing probe…
=> contradiction with the result of driver download!!! => CRAP

3/ nvidia CentOS7/RHEL7 packaging is CRAP as it will upgrade the nvidia-kmod package to 396.26 breaking hardware support and there is not nvidia-390-kmod package: CRAP

4/ the v396.26 module alias matches the M2090 product, thus the kernel tries to load the module while it refuses to handle it: CRAP

5/ v396.24 supported hardware http://www.nvidia.com/object/unix.html doesn’t list the M2090 while searching for Linux driver for the M2090 gives the v396.24 for download!: CRAP

SO:
=> Remove the alias from module so kernel won’t uselessly try to load the v396 nvidia module for M2090
=> UPDATE your driver download page!!! let download a driver that EFFECTIVELY support the M2090
=> CREATE an nvidia-390-kmod package or equivalent so driver won’t be upgraded to something that doesn’t support installed hardware!!! Be professional or opensource your product so talented people can fix your poor support :-<
(and by the way, fix your kmod package with a Require on kernel includes! so the dkms build won’t fail)

It’s not admissble that upgrading a package within the same OS release breaks hardware support!
with your poor packaging, yum update leads to a broken hardware support. Incredible when you see the price of so called “professional” products! This is nothing but professional!