Failed to install both CUDA 11.3 Toolkit and Nvidia-driver 470

Briefly, I’m provided gpu server. The gpu server is NVIDIA Corporation GA100 [A100 SXM4 40GB]. I’m told by the provider that they allocates one of GPUs to me and I’m accessing the server as a user on SSH
I tried to install CUDA 11.3 toolkit which is latest support toolkit by Pytorch.
Problems are,

  1. Failed to install cuda.11-3 following Nvidia official guide , using dpkg .deb file
    CUDA Toolkit 11.3 Update 1 Downloads | NVIDIA Developer
  2. I thought the problem is that I didn’t install the driver before installing cuda-toolkit, I tried to install driver 470 using ,run file, It fails with the error msg ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.

1. installing cuda11.3-toolkit following official document

sudo apt-get -y install cuda

It installs the package as cuda-11.5 so I changed it to

sudo apt-get -y install cuda-11.3

The problem was

You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
cuda-drivers-495 : 
Depends: nvidia-compute-utils-495 (>= 495.29.05) but it is not installed
Depends: nvidia-utils-495 (>= 495.29.05) but it is not installed
nvidia-driver-495 : 
Depends: nvidia-compute-utils-495 (= 495.29.05-0ubuntu1) but it is not installed
Depends: nvidia-utils-495 (= 495.29.05-0ubuntu1) but it is not installed
Recommends: libnvidia-compute-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-decode-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-encode-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-fbc1-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-gl-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

cuda-drivers-495, Nvidia-driver-495 was broken package. I couldn’t resolve this problem by executing àpt --fix-broken install` it let me a problem like this again

dpkg: error processing archive /var/cache/apt/archives/nvidia-compute-utils-495_495.29.05-0ubuntu1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-cuda-mps-control' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Preparing to unpack .../nvidia-utils-495_495.29.05-0ubuntu1_amd64.deb ...
Unpacking nvidia-utils-495 (495.29.05-0ubuntu1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-utils-495_495.29.05-0ubuntu1_amd64.deb (--unpack):
 unable to make backup link of './usr/bin/nvidia-debugdump' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Errors were encountered while processing:
 /var/cache/apt/archives/nvidia-compute-utils-495_495.29.05-0ubuntu1_amd64.deb
 /var/cache/apt/archives/nvidia-utils-495_495.29.05-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

the `dpkg --force-overwrite’ command with option doesn’t resolve the problem again
I guess the crucial problem is following parts of the error

 unable to make backup link of './usr/bin/nvidia-cuda-mps-control' before installing new version: Invalid cross-device link
 unable to make backup link of './usr/bin/nvidia-debugdump' before installing new version: Invalid cross-device link

I failed to resolve the problem then I uninstall the cuda-drivers-495, Nvidia-driver-495 packages using dpkg -r [package]and remove remainder following the official document Installation Guide Linux :: CUDA Toolkit Documentation

2. failed to install driver 470

I’ve thought that I shoulda install driver before installing CUDA Toolkit and the 495 driver which is provided by apt package wasn’t acceptable for my GPU A100. so I tried to install run file.

sh NVIDIA-Linux-x86_64-470.82.01.run
>>>
creation time: Sun Nov  7 19:04:52 2021
installer version: 470.82.01

PATH: /home/innoacad04/anaconda3/envs/fsdl-text-recognizer-2021/bin:/home/innoacad04/anaconda3/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

nvidia-installer command line:
    ./nvidia-installer

Using: nvidia-installer ncurses v6 user interface
-> Detected 128 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

but it remains the error log like this. I tried to unload the nvidia-uvm module

lsmod | grep nvidia
nvidia_uvm           1011712  0
nvidia_drm             49152  0
nvidia_modeset       1183744  2 nvidia_drm
nvidia              19722240  405 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  4 ast,nvidia_drm
drm                   491520  6 drm_kms_helper,drm_vram_helper,ast,nvidia_drm,ttm

but I failed to remove the module

rmmod nvidia-uvm (rammed -r Nvidia-uvm)
rmmod: ERROR: ../libkmod/libkmod-module.c:799 kmod_module_remove_module() could not remove 'nvidia_uvm': Operation not permitted
rmmod: ERROR: could not remove module nvidia-uvm: Operation not permitted

It results the error

modprobe -r nvidia-uvm

It doesn’t react anything. After execute command, I tried to install run file again but failed
I think my permission is restricted by the server owner.

conclusion

How can I solve the problem? My goal is that using gpu at PyTorch. It doesn’t matter to uninstall and reinstall anything.
I want to install CUDA Toolkit 11.3 and Nvidia driver which are available for A100
→ In the official site, 470 for toolkit 11.4, 460.106.00 for toolkit 11.2

My Environment

GPU

# lspci | grep -I nvidia
>>>
01:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
47:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
c1:00.0 VGA compatible controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
c1:00.1 Audio device: NVIDIA Corporation Device 10fa (rev a1)
c2:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)

My Linux

# uname -m && cat /etc/*release
>>>
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Dear taekkim,
I have a similar issue,
have you found out a solution? Can you share with us,
Regards,
Fabiano

@taekkim . @ftarlao
the issue here is that you didnt remove the exisitng drivers and cuda files,
sudo apt-get remove --purge nvidia-* -y
sudo ubuntu-drivers autoinstall
will install the driver and just redo everything in the offical page to install cuda