Cannot install NVIDIA driver on Ubuntu 22.04 with A100

I’m stuck on installing an NVIDIA driver on Ubuntu 22.04 with NVIDIA A100. Does anyone have any suggestion? Basically I’m following the instructions of “NVIDIA Driver Installation Quickstart Guide”.

This is my environment.

$ lspci

06:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:1593]
Kernel modules: nvidiafb, nouveau, nvidia

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
$ uname -r
5.15.0-73-generic

After running “sudo apt-get -y install cuda-drivers”, I rebooted the server and did the post-installation stuff in /etc/environment.

$ cat /etc/environment
PATH=“/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-12.1/bin”
LD_LIBRARY_PATH=“/usr/local/cuda-12.1/lib64”
$ echo $PATH
/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-12.1/bin
$ echo $LD_LIBRARY_PATH
/usr/local/cuda-12.1/lib64

However, nvidia-smi fails and syslog shows the device 10de:20b5 is not supported by the driver 530.30.02.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ sudo tail -n20 /var/log/syslog
Jun 6 05:21:28 kernel: [ 849.878253] nvidia: probe of 0000:06:00.0 failed with error -1
Jun 6 05:21:28 kernel: [ 849.878279] NVRM: The NVIDIA probe routine failed for 1 device(s).
Jun 6 05:21:28 kernel: [ 849.878281] NVRM: None of the NVIDIA devices were initialized.
Jun 6 05:21:28 kernel: [ 849.878478] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Jun 6 05:21:28 systemd-udevd[1091]: nvidia: Process ‘/sbin/modprobe nvidia-uvm’ failed with exit code 1.
Jun 6 05:21:28 systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jun 6 05:21:28 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Jun 6 05:21:28 systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jun 6 05:21:28 kernel: [ 850.002682] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Jun 6 05:21:28 kernel: [ 850.002688] NVRM: The NVIDIA GPU 0000:06:00.0 (PCI ID: 10de:20b5)
Jun 6 05:21:28 kernel: [ 850.002688] NVRM: installed in this system is not supported by the
Jun 6 05:21:28 kernel: [ 850.002688] NVRM: NVIDIA 530.30.02 driver release.
Jun 6 05:21:28 kernel: [ 850.002688] NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’
Jun 6 05:21:28 kernel: [ 850.002688] NVRM: in this release’s README, available on the operating system
Jun 6 05:21:28 kernel: [ 850.002688] NVRM: specific graphics driver download page at www+nvidia+com.
Jun 6 05:21:28 kernel: [ 850.009560] nvidia: probe of 0000:06:00.0 failed with error -1
Jun 6 05:21:28 kernel: [ 850.009585] NVRM: The NVIDIA probe routine failed for 1 device(s).
Jun 6 05:21:28 kernel: [ 850.009587] NVRM: None of the NVIDIA devices were initialized.
Jun 6 05:21:28 kernel: [ 850.009816] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Jun 6 05:21:28 systemd-udevd[1091]: nvidia: Process ‘/sbin/modprobe nvidia-modeset’ failed with exit code 1.

According to the README of 530.30.02 [Appendix A. Supported NVIDIA GPU Products], it supports NVIDIA A100 80GB PCIe 20B5 10DE 1642. Am I missing something?

I found the root cause by myself. I got this machine from Vultr, and their document says NVIDIA driver does not work with the GPU on this Vultr server. If anyone hits the same issue, please check Vultr’s document.

Why you need GRID Drivers
GRID drivers are required to use the GPU on your server. Without them, you will not be able to use the GPU. Using generic consumer drivers from NVIDIA or from the repo is not an option! These drivers simply do not allow vGPU products to function or be used.
NVIDIA only supports GRID drivers for vGPU and the drivers must be within the same driver branch as the hosts the instance is run on.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.