6.5.0 kernel issues w/CUDA 12.1 - how to make nvidia-modprobe install for only one kernel version

TL;DR, how do I:

  1. Upgrade the cuda library versions / downgrade the driver version
  2. Remove a kernel (6.5.0-35) that is somehow “associated” with the linux hwe image / headers
  3. Made nvidia-modprobe install modules for only one kernel version (5.15.0) instead of both 6.5.0-35 and 5.15.0?

I’m using Ubuntu 22.04 and a GeForce RTX 3090.

— (Main text below) —

I’ve been having some issues loading CUDA 12.1 on my ubuntu device, and I believe it’s because of some issues with the 6.5.0 kernel, based on some searches (check this search). Hence, I tried to install the 5.15.0 kernel through apt-get install for the associated kernel image, headers and module.

I was able to boot up the 5.15.0 kernel without error (and I stopped the needrestart), but now when installing CUDA 12.1 I receive this following error at the end:

Error info
Setting up nvidia-dkms-530 (530.30.02-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
update-initramfs: Generating /boot/initrd.img-5.15.0-051500-generic

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
Removing old nvidia-530.30.02 DKMS files...
Module nvidia-530.30.02 for kernel 5.15.0-051500-generic (x86_64).
Before uninstall, this module version was ACTIVE on this kernel.

nvidia.ko:
 - Uninstallation
   - Deleting from: /lib/modules/5.15.0-051500-generic/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


nvidia-modeset.ko:
 - Uninstallation
   - Deleting from: /lib/modules/5.15.0-051500-generic/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


nvidia-drm.ko:
 - Uninstallation
   - Deleting from: /lib/modules/5.15.0-051500-generic/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


nvidia-peermem.ko:
 - Uninstallation
   - Deleting from: /lib/modules/5.15.0-051500-generic/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


nvidia-uvm.ko:
 - Uninstallation
   - Deleting from: /lib/modules/5.15.0-051500-generic/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.

depmod...
Deleting module nvidia-530.30.02 completely from the DKMS tree.
Loading new nvidia-530.30.02 DKMS files...
Building for 5.15.0-051500-generic 6.5.0-35-generic
Building for architecture x86_64
Building initial module for 5.15.0-051500-generic
Secure Boot not enabled on this system.
Done.

nvidia.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.15.0-051500-generic/updates/dkms/

nvidia-modeset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.15.0-051500-generic/updates/dkms/

nvidia-drm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.15.0-051500-generic/updates/dkms/

nvidia-peermem.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.15.0-051500-generic/updates/dkms/

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.15.0-051500-generic/updates/dkms/

depmod...
Building initial module for 6.5.0-35-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-kernel-source-530.0.crash'
Error! Bad return status for module build on kernel: 6.5.0-35-generic (x86_64)
Consult /var/lib/dkms/nvidia/530.30.02/build/make.log for more information.
dpkg: error processing package nvidia-dkms-530 (--configure):
 installed nvidia-dkms-530 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of cuda-drivers-530:
 cuda-drivers-530 depends on nvidia-dkms-530 (>= 530.30.02); however:
  Package nvidia-dkms-530 is not configured yet.

dpkg: error processing package cuda-drivers-530 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of nvidia-driver-530:
 nvidia-driver-530 depends on nvidia-dkms-530 (= 530.30.02-0ubuntu1); however:
  Package nvidia-dkms-530 is not configured yet.

dpkg: error processing package nvidia-driver-530 (--configure):
 dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                                                                                                                                    dpkg: dependency problems prevent configuration of cuda-drivers:
 cuda-drivers depends on cuda-drivers-530 (= 530.30.02-1); however:
  Package cuda-drivers-530 is not configured yet.

dpkg: error processing package cuda-drivers (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-runtime-12-1:
 cuda-runtime-12-1 depends on cuda-drivers (>= 530.30.02); however:
  Package cuda-drivers is not configured yet.

dpkg: error processing package cuda-runtime-12-1 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-12-1:
 cuda-12-1 depends on cuda-runtime-12-1 (>= 12.1.0); however:
  Package cuda-runtime-12-1 is not configured yet.

dpkg: error processing package cuda-12-1 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda:
 cuda depends on cuda-12-1 (>= 12.1.0); however:
  Package cuda-12-1 is not configured yet.No apport report written because MaxReports is reached already
                                                                                                        No apport report written because MaxReports is reached already
                                                                                                                                                                      No apport report written because MaxReports is reached already
                                                                                                                                                                                                                                    No apport report written because MaxReports is reached already


dpkg: error processing package cuda (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of cuda-demo-suite-12-1:
 cuda-demo-suite-12-1 depends on cuda-runtime-12-1; however:
  Package cuda-runtime-12-1 is not configured yet.

dpkg: error processing package cuda-demo-suite-12-1 (--configure):
 dependency problems - leaving unconfigured
No apport report written because MaxReports is reached already
                                                              Processing triggers for initramfs-tools (0.140ubuntu13.4) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-051500-generic
Errors were encountered while processing:
 nvidia-dkms-530
 cuda-drivers-530
 nvidia-driver-530
 cuda-drivers
 cuda-runtime-12-1
 cuda-12-1
 cuda
 cuda-demo-suite-12-1
E: Sub-process /usr/bin/dpkg returned an error code (1)

Mainly, note the info Building for 5.15.0-051500-generic 6.5.0-35-generic. This implies that, even though I’ve booted up 5.15.0-051500-generic, it’s still setting up the nvidia modules for 6.5.0-35-generic as well - and that version is where it fails. The installation works okay, though, for 5.15.0-051500-generic - but I do remember that there was some kind of “needrestart not used” error message.

Regardless, it was working fine for a bit, then Failed to initialize NVML: Driver/library version mismatch. On the day before, checking /var/log/apt/history.log I found:

Log info
Start-Date: 2024-06-08  06:19:43
Commandline: /usr/bin/unattended-upgrade
Install: libnvidia-common-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-fbc1-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-gl-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-extra-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-compute-utils-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-dkms-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-driver-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-encode-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-utils-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), xserver-xorg-video-nvidia-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-decode-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-kernel-common-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-firmware-535-535.171.04:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-cfg1-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), nvidia-kernel-source-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic), libnvidia-compute-535:amd64 (535.171.04-0ubuntu0.22.04.1, automatic)
Upgrade: libnvidia-common-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-fbc1-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-gl-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-extra-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), nvidia-compute-utils-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), nvidia-dkms-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), nvidia-driver-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-encode-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), nvidia-utils-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), xserver-xorg-video-nvidia-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-decode-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), nvidia-kernel-common-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-cfg1-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), nvidia-kernel-source-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1), libnvidia-compute-530:amd64 (530.30.02-0ubuntu1, 535.171.04-0ubuntu0.22.04.1)
Error: Sub-process /usr/bin/dpkg returned an error code (1)
End-Date: 2024-06-08  06:19:48

Note how there’s no user that updated this, and hence I believe that the library version is probably something like 535.182, but not 535.172 which is desired by the driver (I did get an extra message from nvidia-smi about the library version being 535.182 at the beginning as well).

So I think I have a few options:

  1. Figure out how to downgrade my libraries / upgrade my driver.
  2. Remove kernel version 6.5.0
  3. Let nvidia-modprobe only install for kernel version 5.15.0

For 1, it’s not so clear how to do this. 2 seems likely, but when I tried to do so, I receive this info:

Log info

The following packages were automatically installed and are no longer required:
amd64-microcode linux-headers-generic-hwe-22.04 thermald
Use ‘sudo apt autoremove’ to remove them.
The following packages will be REMOVED:
linux-generic-hwe-22.04* linux-image-6.5.0-35-generic* linux-image-generic-hwe-22.04*
0 upgraded, 0 newly installed, 3 to remove and 55 not upgraded.
8 not fully installed or removed.
After this operation, 14.3 MB disk space will be freed.

I probably don’t want to remove the linux hwe kernel - just an instinct. And perhaps this gives some insight for 3 - nvidia-modprobe probably is installing the kernel modules for the “main” version, for which the hwe kernel is associated, or 6.5.0-35-generic. So perhaps I’d like to find a way to “de-main”-ify the 6.5.0 kernel or remove the kernel entirely.

How do I do either one of these 3 things? Thank you very much!