Cuda toolkit, Ubuntu 22.04 with a nvidia driver that can't be upgraded

Hello
I am trying to install the cuda toolkit on my KVM running ubuntu 22.04. I was told to install version 10.1 which I did (https://medium.com/@stephengregory_69986/installing-cuda-10-1-on-ubuntu-20-04-e562a5e724a0. However when I run nvidia-smi, the system returns “Failed to initialize NVML: Driver/library version mismatch”. Dmesg returns the following

4261288.295387] NVRM: API mismatch: the client has the version 525.60.13, but
[4261288.295387] NVRM: this kernel module has the version 418.74. Please
[4261288.295387] NVRM: make sure that this kernel module and all NVIDIA driver

Unfortunately, I can’t update the nvidia driver.

Can you help, please?

Many thanks.

Cheers

Ed

Hi Ed,

from my point of view the past 3-4 Updates from nvidia are broken O_o.

4261288.295387] NVRM: API mismatch: the client has the version 525.60.13, but
[4261288.295387] NVRM: this kernel module has the version 418.74. Please

Means that your driver got an automatic update to the newest version 525.x while in the kernel is still the “old” 418.x version installation.
Theoretically this could be fixed by restarting the computer but this will not help at ALL. The CUDA version is tied to the driver, so you either need to downgrade the driver or Upgrade CUDA O_o. For example driver 525 only is compatible to CUDA 12.0 (which is an Alpha at best) and CUDA 11.7.
For a downgrade you would need working uninstallation scripts, so good luck with that …

The following is just what we do, based on frustration about the quality of the nvidia Ubuntu installation scripts. Only do this if you know what you are doing and how to fix it, also BACKUP everything that is important to you BEFORE DOING ANY OF THIS, big SSDs are cheap these days:

Most uninstall scripts and some installation scripts from nvidia are broken, so as you need a driver and CUDA and cuddnn and whatever, so chances are high you will run into problems during installation. I’m also not sure why you are supposed to run CUDA 10.1 which is a very old version.

*WARNING YOUR COMPUTER IS ABOUT TO EXPLODE IF YOU DO THIS

In the past some parts of nvidia were tied to the gnome-desktop from ubuntu, but for your system 22.04 this should not be the case. Anyway don’t do this you have no idea how to install the desktop from the terminal.

First purge everything from nvidia and cuda:
sudo apt purge “*nvidia*”
sudo apt purge “*cuda*”

Really remove it
sudo apt autoremove

Chances are there is still something left, so look into your typical installation:
p.ex.
ls /usr/local/
And if there is still something, remove it p.ex.
sudo rm -rf /usr/local/cuda10.x

Now restart the computer, and if you still have a desktop everything should be OK and just using the nouveau driver.

Now do:
sudo apt update
And if there is still an update from “*nvidia*” you need to check, if this is the correct repository. If not remove the repository manually. (Actually if there is still something, it probably IS an old repository).
If there is nothing from “*nvidia*” congratulations, you now have a virgin system and can typically follow the installation descriptions from nvidia for your selected configuration, although you might want to check the procedure for “fix broken install”!. This might be helpful when dealing with nvidia:

And don’t forget to block automatic updates for your nvidia driver ;-) or this will happen again

Hi Walter

Many thanks for the reply.

Unfortunately it did not work. 525 is still sitting there somehow. I know I have no access to the host (which uses 418 in all other VMs).

Please correct me if I am wrong but all /dev that exist in my VM are defined by host and /dev/nvidia* that was passed along belongs to the host kernel module. I did not need to define them and even do anything but just use them. Somehow I did but how?

That is what the VM has under /dev

crw-rw-rw- 1 root root 195, 0 Nov 16 20:30 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 16 20:30 nvidiactl
crw-rw-rw- 1 root root 195, 254 Nov 16 20:30 nvidia-modeset
crw-rw-rw- 1 root root 240, 0 Nov 16 20:30 nvidia-uvm
crw-rw-rw- 1 root root 240, 1 Nov 16 20:30 nvidia-uvm-tools

In order to remove everything related to nvidia except the items shown above I did:

rm /etc/apt/sources.list.d/cuda*
apt remove --autoremove nvidia-cuda-toolkit
apt remove --autoremove nvidia-*
sudo apt-get remove --purge '^nvidia-.*
apt-get purge nvidia*
apt-get autoremove
apt-get autoclean
rm -rf /usr/local/cuda*

and reboot as many times as was necessary.

525 is still sitting there. As far as I know I can’t install cuda 12 as the host provides driver 418 and not 525 (or can I? I am utterly confused about what to do).

I have also disabled all graphic sessions, removed nouveau driver and used a plain console to be sure that there is no nvidia in the system (except for /dev) but it did not make any difference.

Any thoughts?

Best

Ed

Hi Ed,

I think it might depend on the VM technology you are using. We are using docker and for docker it is NOT possible to use a different driver for the VM than for the host. Meaning if the host has 4.18 the VMs need to use the same O_o.

So in this case you need to check where this message comes from VM or host. My guess it comes from the VM which has a different driver than the host, but from what you describe it could also come from the host. Can you login to the host and try nvidia-smi ?

But why are you working hands-on on the VM anyway?
Normally you would need to check/change the configuration for the vm and then recreate a new VM?

Hi Walter

I believe it is a docker. I can’t login to the host but I talked to the person responsible for it and he said that the host and all other VMS use 4.18. It seems that I am the only one who did a stupid thing to try to install a new version (I did not know of the necessary to stick to 4.18).

I am working hands-on on the VM because I have two huge projects running on it. I can ask administrator to restore the VM from two weeks ago (before I tried to install anything) but to do that a running job needs to be killed (It runs for days). However I must confess that I am puzzled why I can’t simply remove what I did.

How can I check if the msg comes from VM or host?

Many thanks

Ed