Briefly, I’m provided gpu server. The gpu server is NVIDIA Corporation GA100 [A100 SXM4 40GB]. I’m told by the provider that they allocates one of GPUs to me and I’m accessing the server as a user on SSH
I tried to install CUDA 11.3 toolkit which is latest support toolkit by Pytorch.
Problems are,
- Failed to install cuda.11-3 following Nvidia official guide , using dpkg .deb file
CUDA Toolkit 11.3 Update 1 Downloads | NVIDIA Developer - I thought the problem is that I didn’t install the driver before installing cuda-toolkit, I tried to install driver 470 using ,run file, It fails with the error msg
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.
1. installing cuda11.3-toolkit following official document
sudo apt-get -y install cuda
It installs the package as cuda-11.5 so I changed it to
sudo apt-get -y install cuda-11.3
The problem was
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
cuda-drivers-495 :
Depends: nvidia-compute-utils-495 (>= 495.29.05) but it is not installed
Depends: nvidia-utils-495 (>= 495.29.05) but it is not installed
nvidia-driver-495 :
Depends: nvidia-compute-utils-495 (= 495.29.05-0ubuntu1) but it is not installed
Depends: nvidia-utils-495 (= 495.29.05-0ubuntu1) but it is not installed
Recommends: libnvidia-compute-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-decode-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-encode-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-fbc1-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
Recommends: libnvidia-gl-495:i386 (= 495.29.05-0ubuntu1) but it is not installable
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
cuda-drivers-495, Nvidia-driver-495 was broken package. I couldn’t resolve this problem by executing àpt --fix-broken install` it let me a problem like this again
dpkg: error processing archive /var/cache/apt/archives/nvidia-compute-utils-495_495.29.05-0ubuntu1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-cuda-mps-control' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Preparing to unpack .../nvidia-utils-495_495.29.05-0ubuntu1_amd64.deb ...
Unpacking nvidia-utils-495 (495.29.05-0ubuntu1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-utils-495_495.29.05-0ubuntu1_amd64.deb (--unpack):
unable to make backup link of './usr/bin/nvidia-debugdump' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-compute-utils-495_495.29.05-0ubuntu1_amd64.deb
/var/cache/apt/archives/nvidia-utils-495_495.29.05-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
the `dpkg --force-overwrite’ command with option doesn’t resolve the problem again
I guess the crucial problem is following parts of the error
unable to make backup link of './usr/bin/nvidia-cuda-mps-control' before installing new version: Invalid cross-device link
unable to make backup link of './usr/bin/nvidia-debugdump' before installing new version: Invalid cross-device link
I failed to resolve the problem then I uninstall the cuda-drivers-495, Nvidia-driver-495 packages using dpkg -r [package]
and remove remainder following the official document Installation Guide Linux :: CUDA Toolkit Documentation
2. failed to install driver 470
I’ve thought that I shoulda install driver before installing CUDA Toolkit and the 495 driver which is provided by apt package wasn’t acceptable for my GPU A100. so I tried to install run file.
sh NVIDIA-Linux-x86_64-470.82.01.run
>>>
creation time: Sun Nov 7 19:04:52 2021
installer version: 470.82.01
PATH: /home/innoacad04/anaconda3/envs/fsdl-text-recognizer-2021/bin:/home/innoacad04/anaconda3/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
nvidia-installer command line:
./nvidia-installer
Using: nvidia-installer ncurses v6 user interface
-> Detected 128 CPUs online; setting concurrency level to 32.
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
but it remains the error log like this. I tried to unload the nvidia-uvm
module
lsmod | grep nvidia
nvidia_uvm 1011712 0
nvidia_drm 49152 0
nvidia_modeset 1183744 2 nvidia_drm
nvidia 19722240 405 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 4 ast,nvidia_drm
drm 491520 6 drm_kms_helper,drm_vram_helper,ast,nvidia_drm,ttm
but I failed to remove the module
rmmod nvidia-uvm (rammed -r Nvidia-uvm)
rmmod: ERROR: ../libkmod/libkmod-module.c:799 kmod_module_remove_module() could not remove 'nvidia_uvm': Operation not permitted
rmmod: ERROR: could not remove module nvidia-uvm: Operation not permitted
It results the error
modprobe -r nvidia-uvm
It doesn’t react anything. After execute command, I tried to install run file again but failed
I think my permission is restricted by the server owner.
conclusion
How can I solve the problem? My goal is that using gpu at PyTorch. It doesn’t matter to uninstall and reinstall anything.
I want to install CUDA Toolkit 11.3 and Nvidia driver which are available for A100
→ In the official site, 470 for toolkit 11.4, 460.106.00 for toolkit 11.2
My Environment
GPU
# lspci | grep -I nvidia
>>>
01:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
47:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
c1:00.0 VGA compatible controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
c1:00.1 Audio device: NVIDIA Corporation Device 10fa (rev a1)
c2:00.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
My Linux
# uname -m && cat /etc/*release
>>>
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal