Unable to use NVIDIA GPU on ubuntu 22.04 with Orcale headers

I was able to run nvidia-smi without any issues, and then tried to install the container toolkit as outlined in the documentation. After installing it, nvidia-smi now gives me the following error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Info

$ uname -r
5.15.0-1017-oracle
$ find /lib/modules/$(uname -r) -type f -name '*.ko' | grep nvidia
/lib/modules/5.15.0-1017-oracle/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
$ sudo prime-select nvidia
Error: no integrated GPU detected.
$ sudo modprobe nvidia
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.0-1017-oracle
$ prime-select query
on-demand
$ dpkg -l | grep nvidia
ii  libnvidia-cfg1-515-server:amd64             515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-515-server                 515.65.01-0ubuntu0.22.04.1              all          Shared files used by the NVIDIA libraries
rc  libnvidia-compute-510:amd64                 510.85.02-0ubuntu0.22.04.1              amd64        NVIDIA libcompute package
rc  libnvidia-compute-515:amd64                 515.65.01-0ubuntu1                      amd64        NVIDIA libcompute package
ii  libnvidia-compute-515-server:amd64          515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA libcompute package
ii  libnvidia-compute-515-server:i386           515.65.01-0ubuntu0.22.04.1              i386         NVIDIA libcompute package
ii  libnvidia-decode-515-server:amd64           515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-515-server:i386            515.65.01-0ubuntu0.22.04.1              i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64                1:1.1.9-1.1                             amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-515-server:amd64           515.65.01-0ubuntu0.22.04.1              amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-515-server:i386            515.65.01-0ubuntu0.22.04.1              i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-515-server:amd64            515.65.01-0ubuntu0.22.04.1              amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-515-server:amd64             515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-515-server:i386              515.65.01-0ubuntu0.22.04.1              i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-515-server:amd64               515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-515-server:i386                515.65.01-0ubuntu0.22.04.1              i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
rc  linux-modules-nvidia-510-5.15.0-1017-oracle 5.15.0-1017.22                          amd64        Linux kernel nvidia modules for version 5.15.0-1017
ii  linux-objects-nvidia-510-5.15.0-1017-oracle 5.15.0-1017.22                          amd64        Linux kernel nvidia modules for version 5.15.0-1017 (objects)
ii  linux-signatures-nvidia-5.15.0-1017-oracle  5.15.0-1017.22                          amd64        Linux kernel signatures for nvidia modules for version 5.15.0-1017-oracle
rc  nvidia-compute-utils-510                    510.85.02-0ubuntu0.22.04.1              amd64        NVIDIA compute utilities
rc  nvidia-compute-utils-515                    515.65.01-0ubuntu1                      amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-515-server             515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA compute utilities
rc  nvidia-container-toolkit-base               1.11.0-1                                amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-515                             515.65.01-0ubuntu1                      amd64        NVIDIA DKMS package
ii  nvidia-dkms-515-server                      515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA DKMS package
ii  nvidia-driver-515-server                    515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA Server Driver metapackage
rc  nvidia-kernel-common-510                    510.85.02-0ubuntu0.22.04.1              amd64        Shared files used with the kernel module
rc  nvidia-kernel-common-515                    515.65.01-0ubuntu1                      amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-515-server             515.65.01-0ubuntu0.22.04.1              amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-515-server             515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA kernel source package
ii  nvidia-modprobe                             515.65.01-0ubuntu1                      amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-prime                                0.8.17.1                                all          Tools to enable NVIDIA's Prime
rc  nvidia-settings                             515.65.01-0ubuntu1                      amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-515-server                     515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA Server Driver support binaries
ii  xserver-xorg-video-nvidia-515-server        515.65.01-0ubuntu0.22.04.1              amd64        NVIDIA binary Xorg driver

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Hi @generix, thanks for replying so quickly! Here is the log file

nvidia-bug-report.log.gz (73.9 KB)

Looks like previously signed, precompiled modules v510 were installed but after installing the toolkit this was upgraded and changed to v515 dkms modules and something is missing.
Please reinstall headers
sudo apt install --reinstall linux-headers-$(uname -r)
and post the output of
dkms status
afterwards.

So, I tried running sudo apt install --reinstall linux-headers-$(uname -r) and got this:

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 to upgrade, 0 to newly install, 1 reinstalled, 0 to remove and 0 not to upgrade.
Need to get 0 B/2,892 kB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 228684 files and directories currently installed.)
Preparing to unpack .../linux-headers-5.15.0-1017-oracle_5.15.0-1017.22_amd64.deb ...
Unpacking linux-headers-5.15.0-1017-oracle (5.15.0-1017.22) over (5.15.0-1017.22) ...
Setting up linux-headers-5.15.0-1017-oracle (5.15.0-1017.22) ...
/etc/kernel/header_postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.15.0-1017-oracle
Error! Your kernel headers for kernel 5.15.0-1017-oracle cannot be found.
Please install the linux-headers-5.15.0-1017-oracle package or use the --kernelsourcedir option to tell DKMS where it's located.
Error! Your kernel headers for kernel 5.15.0-1017-oracle cannot be found.
Please install the linux-headers-5.15.0-1017-oracle package or use the --kernelsourcedir option to tell DKMS where it's located.
   ...done.

However, re-running without --reinstall gave

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
linux-headers-5.15.0-1017-oracle is already the newest version (5.15.0-1017.22).
0 to upgrade, 0 to newly install, 0 to remove and 0 not to upgrade.

dkms status then gives

nvidia-srv/515.65.01: added
r8168/8.049.02: added

Looks like the headers package has a bug and doesn’t install the needed link for dkms.
Please check whether /lib/modules/5.15.0-1017-oracle/build
exists and is a symbolic link to the headers’ directory /usr/src/linux-headers-5.15.0-1017-oracle

ls -l /lib/modules/5.15.0-1017-oracle/build

What a spot!

$ ls -l /lib/modules/5.15.0-1017-oracle/build
total 0

Is the solution then to make the sym link?

Yes, please create it
sudo ln -s /usr/src/linux-headers-5.15.0-1017-oracle /lib/modules/5.15.0-1017-oracle/build
Please make sure /usr/src/linux-headers-5.15.0-1017-oracle exists beforehand.
Afterwards, run
sudo dkms install nvidia-srv/515.65.01
to trigger the module compile and post any errors and
dkms status
afterwards.

1 Like

Amazing! That’s worked a treat. Thank you very much!

$ dkms status
nvidia-srv/515.65.01, 5.15.0-1017-oracle, x86_64: installed
r8168/8.049.02: added

$ nvidia-smi
Fri Sep 16 11:47:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:09:00.0 Off |                  Off |
| 30%   59C    P0    88W / 300W |      0MiB / 49140MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

For anyone in the future looking at this post, make sure to delete /lib/modules/5.15.0-1017-oracle/build first before creating the symbolic link.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.