Cuda 11.6 installation error and disabled nvidia-smi

Daear @generix, per this thread in which you helped me extensively, I ended up upgrading the nvidia driver on my Ubuntu 18.04 workstation to version 510: Black screen in Ubuntu 18 even after purging Nvidia and installing drivers from repository - #34 by generix

I hadn’t reinstalled cuda because I could compartmentalize it in conda environments for all applications I needed so far.

Now I needed to install cuda for an application that seemingly cannot be in a conda environment. I mistakenly installed cudatoolkit 11.2 (which the application recommended) following the instructions for “deb local” on the nvidia website.

I got an incompatibility issue and thus remove it with:

sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*"

And

sudo rm -rf /usr/local/cuda*

I learned nvidia driver v 510 requires toolkit 11.6, but in trying to install it per instructions here https://developer.nvidia.com/cuda-11-6-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=deb_local.

I got a broken packages error:

The following packages have unmet dependencies:
cuda : Depends: cuda-11-6 (>= 11.6.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

I do not seem to have any nivida related repo to comment out in

sudo vim /etc/apt/sources.list

(Attaching file here just in case I missed anything relevant
jgalaz_ubuntu18_sources.txt (2.9 KB)
)

The following after uninstalling cuda 11.2 and before attempting installation of cuda 11.6 did not seem to help:

sudo apt --fix-broken install
sudo apt-get autoremove

Autoremove output (below) suggested that driver 460 was still installed on my machine (weird, as previously nvidia-smi output indicated 510 was the driver, as per the thread linked at the beginning of this thread).

Not sure what the issue is . Do I need a fresh terminal after removing a cuda version and installing a new one? Is rebooting required at any point? I’m a bit jaded with nvidia driver and cuda installations and am wary about tinkering with this more without advice.

Particularly perturbed that nvidia-smi isn’t working now, which might suggest that the nvidia driver is again messed up.
I work remotely so this is very stressful.

sudo apt-get autoremove
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following packages will be REMOVED:
libcufft-11-2 libcufft-dev-11-2 libcurand-11-2 libcurand-dev-11-2 libcusolver-11-2 libcusolver-dev-11-2 libcusparse-11-2 libcusparse-dev-11-2 libnpp-11-2
libnpp-dev-11-2 libnvidia-cfg1-460 libnvidia-common-460 libnvidia-decode-460 libnvidia-encode-460 libnvidia-extra-460 libnvidia-fbc1-460 libnvidia-gl-460
libnvidia-ifr1-460 libnvjpeg-11-2 libnvjpeg-dev-11-2 libxnvctrl0 nvidia-compute-utils-460 nvidia-dkms-460 nvidia-driver-460 nvidia-kernel-common-460
nvidia-kernel-source-460 nvidia-modprobe nvidia-prime nvidia-settings nvidia-utils-460 screen-resolution-extra xserver-xorg-video-nvidia-460
0 upgraded, 0 newly installed, 32 to remove and 1 not upgraded.
After this operation, 2,748 MB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database … 233579 files and directories currently installed.)
Removing libcufft-dev-11-2 (10.4.0.72-1) …
Removing libcufft-11-2 (10.4.0.72-1) …
Removing libcurand-dev-11-2 (10.2.3.68-1) …
Removing libcurand-11-2 (10.2.3.68-1) …
Removing libcusolver-dev-11-2 (11.0.2.68-1) …
Removing libcusolver-11-2 (11.0.2.68-1) …
Removing libcusparse-dev-11-2 (11.3.1.68-1) …
Removing libcusparse-11-2 (11.3.1.68-1) …
Removing libnpp-dev-11-2 (11.2.1.68-1) …
Removing libnpp-11-2 (11.2.1.68-1) …
Removing nvidia-driver-460 (460.27.04-0ubuntu1) …
Removing xserver-xorg-video-nvidia-460 (460.27.04-0ubuntu1) …
Removing libnvidia-cfg1-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvidia-ifr1-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvidia-gl-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvidia-common-460 (460.27.04-0ubuntu1) …
Removing libnvidia-encode-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvidia-decode-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvidia-extra-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvidia-fbc1-460:amd64 (460.27.04-0ubuntu1) …
Removing libnvjpeg-dev-11-2 (11.3.1.68-1) …
Removing libnvjpeg-11-2 (11.3.1.68-1) …
dpkg: warning: while removing libnvjpeg-11-2, directory ‘/usr/local’ not empty so not removed
Removing nvidia-settings (470.57.01-0ubuntu0.18.04.1) …
Removing libxnvctrl0:amd64 (470.57.01-0ubuntu0.18.04.1) …
Removing nvidia-compute-utils-460 (460.27.04-0ubuntu1) …
Removing nvidia-dkms-460 (460.27.04-0ubuntu1) …
Removing all DKMS Modules
Done.
INFO:Disable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
update-initramfs: deferring update (trigger activated)
Removing nvidia-kernel-common-460 (460.27.04-0ubuntu1) …
update-initramfs: deferring update (trigger activated)
Removing nvidia-kernel-source-460 (460.27.04-0ubuntu1) …
Removing nvidia-modprobe (460.27.04-0ubuntu1) …
Removing nvidia-prime (0.8.16~0.18.04.1) …
Removing nvidia-utils-460 (460.27.04-0ubuntu1) …
Removing screen-resolution-extra (0.17.3) …
Processing triggers for desktop-file-utils (0.23-1ubuntu3.18.04.2) …
Processing triggers for initramfs-tools (0.130ubuntu3.13) …
update-initramfs: Generating /boot/initrd.img-4.15.0-219-generic
Processing triggers for libc-bin (2.27-3ubuntu1.6) …
Processing triggers for man-db (2.8.3-2ubuntu0.1) …
Processing triggers for gnome-menus (3.13.3-11ubuntu1.1) …
Processing triggers for dbus (1.12.2-1ubuntu1.4) …
Processing triggers for mime-support (3.60ubuntu1) …

Thank you!

The issue was that you installed full “cuda” metapackage which also installs a different driver. So that one got downgraded to 460, coming with cuda 11.2. To have the driver installed from ubuntu repo, use Software&Updates to install the driver, then only install cuda-toolkit, e.g. sudo apt install cuda-toolkit-11-6 to install cuda toolkit 11.6 and leave the already installed driver intact.

Ok, I’m assuming I need to first purge my current installation of cuda-toolkit-16 using the instructions for deb local from the Nvidia site (I think there may be an uninstaller to do that properly).
Do I need to purge the Nvidia driver as well…?

I did try sudo apt install first, from within a conda environment, but my problem with that was that I was never able to find where in env conda had put cuda.
Then I tried installation with sudo apt not within a conda environment, but similarly, I didn’t see a /usr/local/cuda/ directory, and a search of the entire workstation for ‘nvcc’ returned nothing.

The application I’m trying to build requires that I provide CUDAHOME=/path-to-cuda/, and it was only with the deb local instructions from Nvidia that I could provide /usr/local/cuda-16.x to compile the application (which didn’t work because of the broken installation, but if I don’t provide CUDAHOME I can’t even run the compilation, it errors out immediately…).

I get this after installing the 510 driver with Software&Updates and running nvidia-smi on a fresh terminal:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

[Should I reboot?].

Yes, you need to reboot.