Followed guide NVIDIA CUDA Installation Guide for Linux, failing at driver install

EDIT: I wasn’t able to post the issue as is since I kept getting ‘new users are only allowed to post 3 links per post’ so I replaced all the periods with (dot) and deleted all instances of ‘https’ ,‘com’, and ‘html’ in order to pass through the spam filter. It seems that a lot of the output was being interpreted as links, since much of it doed contain links.


I am trying to install cuda 11(dot)1, both the runtime api and on my gpu(dot)

I am running Ubuntu x86_64 18(dot)04(dot) I have tried upgrading my Cuda runtime to 11(dot)1 but have not been able to do so(dot) The driver has been updated, but not my runtime api(dot)

nvidia-smi

Shows that I have upgraded to 11(dot)0, but

nvcc -V

Shows version 10(dot)0(dot)130 installed for the runtime API(dot)

Following the instructions from
docs(dot)nvidia (dot) /cuda/cuda-installation-guide-linux/index (dot)

I will go through the commands in order listed in the guide(dot)

Section 2(dot) Pre-installation Actions

lspci | grep -i nvidia resulted in

19:00(dot)0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
19:00(dot)1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
19:00(dot)2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
19:00(dot)3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1a:00(dot)0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1a:00(dot)1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1a:00(dot)2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1a:00(dot)3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
67:00(dot)0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
67:00(dot)1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
67:00(dot)2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
67:00(dot)3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
68:00(dot)0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
68:00(dot)1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
68:00(dot)2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
68:00(dot)3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)

uname -m && cat /etc/*release resulted in

x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18(dot)04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18(dot)04(dot)3 LTS"
NAME="Ubuntu"
VERSION="18(dot)04(dot)3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18(dot)04(dot)3 LTS"
VERSION_ID="18(dot)04"
HOME_URL="://(dot)ubuntu (dot)/"
SUPPORT_URL="://help(dot)ubuntu (dot)/"
BUG_REPORT_URL="://bugs(dot)launchpad (dot)net/ubuntu/"
PRIVACY_POLICY_URL="://(dot)ubuntu (dot)/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

gcc --version results

gcc (Ubuntu 7(dot)5(dot)0-3ubuntu1~18(dot)04) 7(dot)5(dot)0
Copyright (C) 2017 Free Software Foundation, Inc(dot)
This is free software; see the source for copying conditions(dot)  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE(dot)

uname -r results in

5(dot)4(dot)0-51-generic

sudo apt-get install linux-headers-$(uname -r) results in

Reading package lists(dot)(dot)(dot) Done
Building dependency tree       
Reading state information(dot)(dot)(dot) Done
linux-headers-5(dot)4(dot)0-51-generic is already the newest version (5(dot)4(dot)0-51(dot)56~18(dot)04(dot)1)(dot)
linux-headers-5(dot)4(dot)0-51-generic set to manually installed(dot)
The following packages were automatically installed and are no longer required:
  dkms libaccinj64-10(dot)0 libatomic1:i386 libboost-python1(dot)65(dot)1 libbsd0:i386 libc-ares2 libcublas10(dot)0 libcudnn7 libcufft10(dot)0 libcufftw10(dot)0 libcuinj64-10(dot)0 libcupti-dev libcupti-doc libcupti10(dot)0 libcurand10(dot)0
  libcusolver10(dot)0 libcusparse10(dot)0 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386 libedit2:i386 libelf1:i386 libexpat1:i386 libffi6:i386 libgflags2(dot)2 libgl1:i386
  libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386 libglx-mesa0:i386 libglx0:i386 libgoogle-glog0v5 libgrpc7 libjs-sphinxdoc libleveldb1v5 libllvm10:i386 liblmdb0 libnppc10(dot)0 libnppial10(dot)0 libnppicc10(dot)0
  libnppicom10(dot)0 libnppidei10(dot)0 libnppif10(dot)0 libnppig10(dot)0 libnppim10(dot)0 libnppist10(dot)0 libnppisu10(dot)0 libnppitc10(dot)0 libnpps10(dot)0 libnvblas10(dot)0 libnvgraph10(dot)0 libnvidia-cfg1-450 libnvidia-common-450
  libnvidia-compute-450:i386 libnvidia-decode-450 libnvidia-decode-450:i386 libnvidia-encode-450 libnvidia-encode-450:i386 libnvidia-extra-450 libnvidia-extra-450:i386 libnvidia-fbc1-450 libnvidia-fbc1-450:i386
  libnvidia-gl-450 libnvidia-gl-450:i386 libnvidia-ifr1-450 libnvidia-ifr1-450:i386 libnvrtc10(dot)0 libnvtoolsext1 libnvvm3 libpciaccess0:i386 libprotobuf18 libprotoc18 libsensors4:i386 libsleef3 libstdc++6:i386
  libthrust-dev libvdpau-dev libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386 libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-sync1:i386 libxcb1:i386 libxdamage1:i386 libxdmcp6:i386
  libxext6:i386 libxfixes3:i386 libxnvctrl0 libxshmfence1:i386 libxxf86vm1:i386 pkg-config protobuf-compiler python-absl python-astor python-cffi python-configparser python-future python-gast python-grpcio
  python-leveldb python-networkx python-pasta python-ply python-protobuf python-pycparser python-pywt python-skimage python-skimage-lib python-termcolor python-typing python-wrapt python3-absl python3-astor
  python3-cffi python3-future python3-gast python3-grpcio python3-leveldb python3-markdown python3-networkx python3-pasta python3-ply python3-pycparser python3-pyinotify python3-pywt python3-skimage python3-skimage-lib
  python3-tensorflow-serving python3-termcolor python3-werkzeug python3-wrapt screen-resolution-extra xserver-xorg-video-nvidia-450
Use 'sudo apt autoremove' to remove them(dot)
0 upgraded, 0 newly installed, 0 to remove and 179 not upgraded(dot)

Section 2(dot)7(dot) Handle Conflicting Installation Methods

I ran the following commands

sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove cuda*  
sudo apt-get --purge remove nvidia*  
sudo apt-get --purge remove libcuda*  

I tried looking for

sudo /usr/local/cuda-X(dot)Y/bin/uninstall_cuda_X(dot)Y(dot)pl

But there wasn’t any file with that name in bin, so I don’t think the previous cuda was installed with runfile(dot)

I checked both nvidia-smi and nvcc -V and both times the commands weren’t found, but when(dot) When I was running the installer, I kept getting a warning message there is is a previous installer,

Existing package manager installation of the driver found(dot) It is strongly recommended that you remove this before continuing(dot)

so I tried some other methods to remove the cuda installations

sudo apt-get --purge remove cuda-11(dot)0
sudo apt-get --purge remove cuda-11(dot)1 
sudo apt-get --purge remove cuda-10(dot)0 
sudo apt-get purge nvidia*
sudo apt-get remove --purge cuda-* libcuda* nvidia* 
sudo rm /etc/apt/sources(dot)list(dot)d/cuda*
sudo apt remove --autoremove nvidia-cuda-toolkit
sudo dpkg -l | grep nvidia
sudo apt purge cuda
sudo apt purge -y nvidia
sudo apt remove -y nvidia-*
sudo rm /etc/apt/sources(dot)list(dot)d/cuda*
sudo apt autoremove -y && apt autoclean -y
sudo rm -rf /usr/local/cuda*

Section 6(dot) Runfile Installation

6(dot)3(dot) Disabling Nouveau

I ran the following commands

touch /etc/modprobe(dot)d/blacklist-nouveau(dot)conf

And added

blacklist nouveau
options nouveau modeset=0

To that file(dot) Then I executed

update-initramfs: Generating /boot/initrd(dot)img-5(dot)4(dot)0-52-generic

Which resulted in

update-initramfs: Generating /boot/initrd(dot)img-5(dot)4(dot)0-52-generic

I then tested lsmod | grep nouveau to see if it prints anything, and it didn’t(dot)

I then tried this installation

://developer(dot)nvidia (dot)/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

Which gave these commands

wget ://developer(dot)download(dot)nvidia (dot)/compute/cuda/11(dot)1(dot)0/local_installers/cuda_11(dot)1(dot)0_455(dot)23(dot)05_linux(dot)run
sudo sh cuda_11(dot)1(dot)0_455(dot)23(dot)05_linux(dot)run

I downloaded the installer and ran sudo sh cuda_11(dot)1(dot)0_455(dot)23(dot)05_linux(dot)run

Which resulted in this message

 Installation failed(dot) See log at /var/log/cuda-installer(dot)log for details(dot)

I opened that file, and this was the contents

[INFO]: Driver not installed(dot)
[INFO]: Checking compiler version(dot)(dot)(dot)
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 7(dot)5(dot)0 (Ubuntu 7(dot)5(dot)0-3ubuntu1~18(dot)04)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455(dot)23(dot)05
[INFO]: Executing NVIDIA-Linux-x86_64-455(dot)23(dot)05(dot)run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed(dot)
[ERROR]: Install of 455(dot)23(dot)05 failed, quitting

So it looks like the installation is failing at the driver(dot) I’m not sure what may have been causing this error since 11(dot)0 had been previously installed onto the GPU(dot)

I then tried to install via deb

://developer(dot)nvidia (dot)/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal

Which gave these commands

wget ://developer(dot)download(dot)nvidia(dot)/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804(dot)pin
sudo mv cuda-ubuntu1804(dot)pin /etc/apt/preferences(dot)d/cuda-repository-pin-600
wget https://developer (dot)download(dot)nvidia (dot)com/compute/cuda/11(dot)1(dot)0/local_installers/cuda-repo-ubuntu1804-11-1-local_11(dot)1(dot)0-455(dot)23(dot)05-1_amd64(dot)deb
sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11(dot)1(dot)0-455(dot)23(dot)05-1_amd64(dot)deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80(dot)pub
sudo apt-get update
sudo apt-get -y install cuda

The last command seemed to give an error, the rest of the commands seemed to run fine without issue(dot) This was the output for the last command sudo apt-get -y install cuda, which gave this output

`Reading package lists(dot)(dot)(dot) Done
Building dependency tree       
Reading state information(dot)(dot)(dot) Done
Some packages could not be installed(dot) This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming(dot)
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda : Depends: cuda-11-1 (>= 11(dot)1(dot)0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages(dot)

In trying to troubleshoot the driver install, I found that sudo apt install nvidia-450-dev might work instead, so I tried it, and it worked

nvidia-smi

Showed the following

Mon Oct 26 18:27:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450(dot)66       Driver Version: 450(dot)66       CUDA Version: 11(dot)0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp(dot)A | Volatile Uncorr(dot) ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M(dot) |
|                               |                      |               MIG M(dot) |
|===============================+======================+======================|
|   0  GeForce RTX 208(dot)(dot)(dot)  Off  | 00000000:19:00(dot)0 Off |                  N/A |
| 22%   31C    P8     1W / 250W |      6MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208(dot)(dot)(dot)  Off  | 00000000:1A:00(dot)0 Off |                  N/A |
| 22%   35C    P8     4W / 250W |      6MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208(dot)(dot)(dot)  Off  | 00000000:67:00(dot)0 Off |                  N/A |
| 22%   37C    P8     6W / 250W |      6MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208(dot)(dot)(dot)  Off  | 00000000:68:00(dot)0 Off |                  N/A |
| 22%   39C    P8     1W / 250W |     26MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                  9MiB |
|    3   N/A  N/A      1653      G   /usr/bin/gnome-shell               14MiB |
+-----------------------------------------------------------------------------+

However, the driver is for 11(dot)0, not 11(dot)1(dot)

So I then tried installing and old version of cuda, 11(dot)0 instead of 11(dot)1(dot)

This is only for the driver, and not the runtime API(dot)

Running nvcc -V gives “bash: /usr/bin/nvcc: No such file or directory”

I then tried to install 11(dot)0, as the runtime API should be a lower or equal version than the driver version(dot)

From

://developer(dot)nvidia (dot)/cuda-11(dot)0-download-archive

I selected this install
://developer(dot)nvidia (dot)/cuda-11(dot)0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

Which gave the following commands,

wget ://developer (dot)download (dot)nvidia (dot)/compute/cuda/11(dot)0(dot)2/local_installers/cuda_11(dot)0(dot)2_450(dot)51(dot)05_linux(dot)run
sudo sh cuda_11(dot)0(dot)2_450(dot)51(dot)05_linux(dot)run

After downloading the installer, running sudo sh cuda_11(dot)0(dot)2_450(dot)51(dot)05_linux(dot)run

First gave me a warning about a previous version being installed again, probably from the driver installation(dot) I selected to continue since I would only be installing the toolkit and not the driver(dot) I continued, and selected to install everything except for the Driver

 CUDA Installer                                                               │
│ - [ ] Driver                                                                 │
│      [ ] 450(dot)51(dot)05                                                           │
│ + [X] CUDA Toolkit 11(dot)0                                                      │
│   [X] CUDA Samples 11(dot)0                                                      │
│   [X] CUDA Demo Suite 11(dot)0                                                   │
│   [X] CUDA Documentation 11(dot)0                                                │
│   Options                                                                    │
│   Install                                                                    │
│                                                                              │
│                                                                              │
│                         

After the installation, I got this message

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11(dot)0/
Samples:  Installed in /home/santosh/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-11(dot)0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11(dot)0/lib64, or, add /usr/local/cuda-11(dot)0/lib64 to /etc/ld(dot)so(dot)conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11(dot)0/bin

Please see CUDA_Installation_Guide_Linux(dot)pdf in /usr/local/cuda-11(dot)0/doc/pdf for detailed information on setting up CUDA(dot)
***WARNING: Incomplete installation! This installation did not install the CUDA Driver(dot) A driver of version at least (dot)00 is required for CUDA 11(dot)0 functionality to work(dot)
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>(dot)run --silent --driver

Logfile is /var/log/cuda-installer(dot)log

I added /usr/local/cuda-11(dot)0/bin to PATH and set LD_LIBRARY_PATH to /usr/local/cuda-11(dot)0/lib64

I then attempted the post installation instructions here ://docs (dot)nvidia (dot)com/cuda/cuda-installation-guide-linux/index (dot)#power9-setup

systemctl status nvidia-persistenced resulted in “Unit nvidia-persistenced(dot)service could not be found(dot)”

sudo systemctl enable nvidia-persistenced resulted in

The unit files have no installation config (WantedBy, RequiredBy, Also, Alias
settings in the [Install] section, and DefaultInstance for template units)(dot)
This means they are not meant to be enabled using systemctl(dot)
Possible reasons for having this kind of units are:
1) A unit may be statically enabled by being symlinked from another unit's
   (dot)wants/ or (dot)requires/ directory(dot)
2) A unit's purpose may be to act as a helper for some other unit which has
   a requirement dependency on it(dot)
3) A unit may be started when needed via activation (socket, path, timer,
   D-Bus, udev, scripted systemctl call, (dot)(dot)(dot))(dot)
4) In case of template units, the unit is meant to be enabled with some
   instance name specified(dot)

I was able to do the udeve rule instructions without issue; I ran the following commands

sudo cp /lib/udev/rules(dot)d/40-vm-hotadd(dot)rules /etc/udev/rules(dot)d
sudo sed -i '/SUBSYSTEM=="memory", ACTION=="add"/d' /etc/udev/rules(dot)d/40-vm-hotadd(dot)rules

I tried nvcc -V just to check if the installation somehow worked otherwise(dot) This time I got this message

Command 'nvcc' not found, but can be installed with:

sudo apt install nvidia-cuda-toolkit

So I tried the command, and it seemed to install with no issues(dot) When I ran nvcc -V again, I got this message

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10(dot)0, V10(dot)0(dot)130

Which is the version of CUDA that I started with(dot)

Looking at this message

://forums (dot)developer (dot)nvidia (dot)com/t/cuda-10-installation-problems-on-ubuntu-18-04/68615

follow the instructions in the linux install guide: ://docs (dot)nvidia (dot)/cuda/cuda-installation-guide-linux/index(dot)html 836

get your installers from ://(dot)nvidia (dot)com/getcuda 267

Now that you’ve already installed the wrong drivers, read the linux install guide carefully(dot) Failure to follow it carefully will result in more trouble(dot)

It seems that the alternative ways on installing onto the gpu and toolkit (with sudo apt install nvidia-450-dev and sudo apt install nvidia-cuda-toolkit) )are not recommended, and that the instruction guide should be followed exactly(dot)

However, I followed the instructions, and it was not able to install onto the driver(dot) Driver installation doesn’t seem impossible as the alternative command somehow worked, but the error log didn’t give me any insights into how I might be able to install it the official way(dot)

I solved the issue. The hardware came with it’s own installation files for cuda, which I didn’t know about. Once those were blocked installation worked perfectly.