I just resolved this issue and would like to share my experience here. I have finally installed the suitable NVIDIA driver, CUDA toolkit, and PyTorch, and everything works fine (at least for now lollll). Hope this helps!
-
What I did before the issue occurred
I didn’t have enough knowledge about NVIDIA drivers and stuff, and was really messing with the NVIDIA drivers and CUDA toolkits. I have tried to install NVIDIA drivers of version: 515, 525, 535, 550, CUDA toolkit of version: 11.7, 11.8, 12.1, 12.4, and PyTorch of version: 1.13, 2.0, 2.1, 2.2, 2.3 (so no wonder I finally messed up my system lollll). -
To install the NVIDIA driver
Check the NVIDIA driver status by the command:nvidia-smiThis should print a table including the NVIDIA driver version and the supported CUDA version. For example, mine is
Driver Version: 535.171.04 CUDA Version: 12.2(on Ubuntu 20.04).
If there are errors printing the info, try to reinstall the NVIDIA driver. The best way to do it is to go to the Ubuntu GUI: Software & Updates → Additional Drivers → [choose an NVIDIA driver version and click Apply Changes]).
(There could also be some errors but there are plenty of posts to resolve them. Hope this is not an issue for you). -
To clear the error before installing CUDA
CUDA 11.7 is a recommended version for NVIDIA driver 535.
I first followed the process from the official website CUDA Toolkit 11.7 Update 1 Downloads but some steps there need to be modified and would actually affect the results.
First, I removed all the unpacked CUDA packages in/var/ folder and also the package management info.
I used the following command since I had multiple CUDA packages there:sudo rm -r /var/cuda-repo-* sudo rm /etc/apt/sources.list.d/cuda*THIS IS THE VERY STEP that clears the above error for me.
Then, clean the apt caches using:sudo apt autoremove sudo apt cleanNow, reboot and check everything again:
nvidia-smi sudo apt-get update sudo apt-get upgradeHopefully, the NVIDIA driver is still working fine.
-
To install the CUDA toolkit
I followed the official steps but changed the last command to
sudo apt-get -y install cuda-toolkit-11-7
(refer to this post Problems loading nvidia drivers after cuda toolkit installation).
Since I started over with the downloadedcuda-repo-xxx.debfile (for example, mine iscuda-repo-ubuntu2004-11-7-local_11.7.1-515.65.01-1_amd64.deb), the following commands are:sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.1-515.65.01-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda-toolkit-11-7Then, edit the environment variables in the
~/.bashrcfile. Choose whatever text editor you like (I usedgedit ~/.bashrc), and add the following lines to it:export PATH=$PATH:/usr/local/cuda-11.7/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.7/lib64and save the file and source it using
source ~/.bashrc.
Now, check the CUDA version usingnvcc --versionThis should print something like:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0 -
(Additional) To install PyTorch that supports CUDA
I would like to add this part because this was the beginning of my nightmare. I have to say the PyTorch official steps are quite unclear to me when I would like to install the PyTorch that supports an older version of CUDA (i.e., 11.7).
In my case, only PyTorch <= 1.13 can work with my CUDA 11.7! So I finally found the commands on the official page here Installing previous versions of PyTorch.
Since I am using conda, I used:# first, don't forget to use `conda activate xxx` to activate your env conda install pytorch==1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidiaI didn’t install
torchvisionandtorchaudiobecause I saw some posts saying that they could cause other errors so I just skipped them.
Finally, to check if your PyTorch works with CUDA, simply run:python -c "import torch; print(torch.cuda.is_available())"which should return
True.
Moreover, here is another “fancier” Python script to check the PyTorch with CUDA:import torch print(f"PyTorch version: {torch.__version__}") # Should print the installed PyTorch version print(f"CUDA available: {torch.cuda.is_available()}") # Should print 'True' if torch.cuda.is_available(): print(f"CUDA version: {torch.version.cuda}") # Should print the CUDA version, e.g., '11.3' or '11.7' print(f"Device Name: {torch.cuda.get_device_name(0)}") # Should print the name of your GPU, e.g., 'GeForce GTX 1050 Ti' else: print("CUDA is not available. Check your installation.")