Reinstalling CUDA for Tensorflow on Linux

Recently I discovered that Tensorflow doesn’t have access to the GPU anymore. I.e. executing tensorflow.config.list_physical_devices only includes the CPU but not the GPU. I also noticed that the CUDA version returned by nvcc --version (Cuda compilation tools, release 10.1, V10.1.243) is older than what Tensorflow requires (11.2). Anyways, since I was gonna do a fresh install of CUDA I thought I might bump Tensorflow from 2.4.1 to 2.5.0 too to take advantage of the newest features.

When trying to install CUDA it complains about many packages that have unmet dependencies.

But from the start. I followed this guide from NVIDIA.

  • pre-installation checks
    • lspci | grep -i nvidia lists:
2d:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
2d:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
2d:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
2d:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
    • I think my linux should be compatible. I’m on Linux Mint 20.1 which is based on Ubuntu 20.04 which is supported. Also, tensorflow used to be able to recognize the GPU on the same system before.
    • gcc --version gives 9.3.0, same as in the system requirements for Ubuntu 20.04
    • uname -r gives 5.4.0-73-generic (the system requirements say I need 5.4.0)
  • I tried to install MLNX_OFED with sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support but to no avail because I got sudo: ./mlnxofedinstall: command not found. So I skipped over this. I don’t think I need it anyways.
  • I removed the current CUDA installation with sudo apt-get --purge remove cuda
  • As suggested here, I also removed everything else connected to NVIDIA with sudo apt-get remove --purge '^nvidia-.*'.
  • I followed the installation code for Linux/x86_64/Ubuntu/20.04/deb(network) from NVIDIA and the first couple lines ran successfully:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
  • But the last line, sudo apt-get -y install cuda failed with the error:
The following packages have unmet dependencies:
 cuda : Depends: cuda-11-4 (>= 11.4.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
  • As suggested by a moderator on the NVDIA forums I kept adding more packages to the list that it said it couldn’t install until I got to:
(tf_gpu) lukas@Makushin:/usr/local$ sudo apt install cuda cuda-11-4 cuda-runtime-11-4 cuda-demo-suite-11-4 cuda-drivers cuda-drivers-470 nvidia-driver-470 nvidia-settings nvidia-installer-cleanup nvidia-alternative xserver-xorg-video-nvidia-470 glx-alternative-nvidia xserver-xorg-video-nvidia-470
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package glx-alternative-nvidia is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
  • I have no clue where to get glx-alternative-nvidia from and I think installing CUDA is not supposed to be that complicated. After all, I installed it just two months ago without trouble. So this is where I gave up.

Does anybody have an idea what’s going on there?

What steps do I need to follow to get my GPU working with TensorFlow?

System:

  • Linux Mint 20.1
  • with a NVIDIA RTX 2080 7.5 graphics card

cross-posted from: Unix/Linux Stackexchange