I’m seeing the same problem with 515 drivers from the CentOS 7 repos as well. I can’t find any documentation about the 515 branch. It looks like 510 is supposed to be the latest production drivers, but somehow 515 got added to the repos on May 4.
I just updated the drivers on one of my on-site GPU nodes and the 515 drivers do work here. My initial test was also in AWS so it looks like there may be a problem with AWS or th specific cards that they’re using. This is the card that I have in my on-site node that works:
I think that I finally figured out what is wrong here. When Nvidia released the R515 drivers they create 2 different packages for the kernel modules. One package includes the new Open Source drivers that only support Turing, Ampere and later and another package that includes the proprietary drivers that they have been shipping for years that include all the architectures that they have been supporting in previous versions, including Volta. The problem is that if you try to install one of the packages that depend on the kernel modules the Open source version will get pulled in by default. The way around it is to explicitly install the kmod-nvidia-latest-dkms package if you need support for Volta or earlier GPUs. The package with the Open Source drivers that doesn’t work with these cards is kmod-nvidia-open-dkms.
It would be great if Nvidia could document this fact when referencing their package repositories.
I am getting the same problem on RHEL 7.9, v100, using the kmod-nvidia-latest-dkms package.
lspci knows I have a v100 but the driver does not know the PCI ID it lists is for a v100.
modprobe -vv nvidia … fail card not supported…
And this is a VM. So that maybe a problem too.
missing the nvidia-kmod-common*, this could be a problem.
This now looks like a firmware problem.
Maybe it is a GRID 9.1 issue. we need to update that anyway.