Hey guys! After installation of Nvidia drivers 515 on Ubuntu 18.04 with V100 on board
nvidia-smi stopped recognizing GPU.
No devices were found
Drivers were not upgraded, but rather installed on a fresh Ubuntu 18 (on AWS) machine
Drivers were installed using Nvidia apt repository Index of /compute/cuda/repos/ubuntu1804/x86_64
apt-get upgrade before installing drivers
cuda-drivers-510 GPU is visible, but I can’t install CUDA through apt-get since
cuda package has dependency on latest driver version.
#:~$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
#:~$ inxi -G
Graphics: Card-1: Cirrus Logic GD 5446
Card-2: NVIDIA GV100GL [Tesla V100 SXM2 16GB]
Display Server: X.org 1.20.8 driver: nvidia tty size: 158x43 Advanced Data: N/A out of X
#:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.43.04 Tue Apr 26 15:52:32 UTC 2022
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
I’m seeing the same problem with 515 drivers from the CentOS 7 repos as well. I can’t find any documentation about the 515 branch. It looks like 510 is supposed to be the latest production drivers, but somehow 515 got added to the repos on May 4.
I just updated the drivers on one of my on-site GPU nodes and the 515 drivers do work here. My initial test was also in AWS so it looks like there may be a problem with AWS or th specific cards that they’re using. This is the card that I have in my on-site node that works:
1b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
This is the card in the p3 instance type in AWS:
00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
I’ll also note I’m seeing this error in dmesg when I try to run nvidia-smi on the AWS instance with the 515 driver.
[Wed May 25 19:19:20 2022] NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x25:0x17:1417)
[Wed May 25 19:19:20 2022] NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
[Wed May 25 19:19:21 2022] NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x25:0x17:1417)
[Wed May 25 19:19:21 2022] NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
Downgrading to the 510.73.08 driver in the same instance works fine.
I think that I finally figured out what is wrong here. When Nvidia released the R515 drivers they create 2 different packages for the kernel modules. One package includes the new Open Source drivers that only support Turing, Ampere and later and another package that includes the proprietary drivers that they have been shipping for years that include all the architectures that they have been supporting in previous versions, including Volta. The problem is that if you try to install one of the packages that depend on the kernel modules the Open source version will get pulled in by default. The way around it is to explicitly install the kmod-nvidia-latest-dkms package if you need support for Volta or earlier GPUs. The package with the Open Source drivers that doesn’t work with these cards is kmod-nvidia-open-dkms.
It would be great if Nvidia could document this fact when referencing their package repositories.