Ubuntu 18 - cuda-drivers-515 - “No devices were found” for Tesla V100

Hey guys! After installation of Nvidia drivers 515 on Ubuntu 18.04 with V100 on board nvidia-smi stopped recognizing GPU.


#:~$ nvidia-smi

No devices were found

  • Drivers were not upgraded, but rather installed on a fresh Ubuntu 18 (on AWS) machine

  • Drivers were installed using Nvidia apt repository http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/

  • apt-get upgrade before installing drivers

  • With cuda-drivers-510 GPU is visible, but I can’t install CUDA through apt-get since cuda package has dependency on latest driver version.


#:~$ /usr/local/cuda/bin/nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2019 NVIDIA Corporation

Built on Wed_Oct_23_19:24:38_PDT_2019

Cuda compilation tools, release 10.2, V10.2.89


#:~$ inxi -G

Graphics: Card-1: Cirrus Logic GD 5446

Card-2: NVIDIA GV100GL [Tesla V100 SXM2 16GB]

Display Server: X.org 1.20.8 driver: nvidia tty size: 158x43 Advanced Data: N/A out of X


#:~$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.43.04 Tue Apr 26 15:52:32 UTC 2022

GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)


#:~$ lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 18.04.6 LTS

Release: 18.04

Codename: bionic

3 Likes

I’m seeing the same problem with 515 drivers from the CentOS 7 repos as well. I can’t find any documentation about the 515 branch. It looks like 510 is supposed to be the latest production drivers, but somehow 515 got added to the repos on May 4.

I just updated the drivers on one of my on-site GPU nodes and the 515 drivers do work here. My initial test was also in AWS so it looks like there may be a problem with AWS or th specific cards that they’re using. This is the card that I have in my on-site node that works:

1b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

This is the card in the p3 instance type in AWS:

00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

I’ll also note I’m seeing this error in dmesg when I try to run nvidia-smi on the AWS instance with the 515 driver.

[Wed May 25 19:19:20 2022] NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x25:0x17:1417)
[Wed May 25 19:19:20 2022] NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0
[Wed May 25 19:19:21 2022] NVRM: GPU 0000:00:1e.0: RmInitAdapter failed! (0x25:0x17:1417)
[Wed May 25 19:19:21 2022] NVRM: GPU 0000:00:1e.0: rm_init_adapter failed, device minor number 0

Downgrading to the 510.73.08 driver in the same instance works fine.

I think that I finally figured out what is wrong here. When Nvidia released the R515 drivers they create 2 different packages for the kernel modules. One package includes the new Open Source drivers that only support Turing, Ampere and later and another package that includes the proprietary drivers that they have been shipping for years that include all the architectures that they have been supporting in previous versions, including Volta. The problem is that if you try to install one of the packages that depend on the kernel modules the Open source version will get pulled in by default. The way around it is to explicitly install the kmod-nvidia-latest-dkms package if you need support for Volta or earlier GPUs. The package with the Open Source drivers that doesn’t work with these cards is kmod-nvidia-open-dkms.

It would be great if Nvidia could document this fact when referencing their package repositories.

1 Like

I am getting the same problem on RHEL 7.9, v100, using the kmod-nvidia-latest-dkms package.
lspci knows I have a v100 but the driver does not know the PCI ID it lists is for a v100.
modprobe -vv nvidia … fail card not supported…
And this is a VM. So that maybe a problem too.
missing the nvidia-kmod-common*, this could be a problem.
This now looks like a firmware problem.
Maybe it is a GRID 9.1 issue. we need to update that anyway.