Hi,
I added Tesla P100 16GB to Dell PowerEdge R730 server, running on Ubuntu 22.04.3 LTS Server.
uname -r
5.15.0-79-generic
uname -a
Linux atlas 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:03:00.0 Off | 0 |
| N/A 45C P0 32W / 250W | 0MiB / 16280MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
nvcc --version
Command ‘nvcc’ not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
When I try to install cuda tool kit, it deletes driver 470 during installation.
How can I install Cuda???
Thank you!
sudo lshw -c video
*-display
description: 3D controller
product: GP100GL [Tesla P100 PCIe 16GB]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:03:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list fb
configuration: depth=32 driver=nvidia latency=0 mode=1440x900 visual=truecolor xres=1440 yres=900
resources: iomemory:3b80-3b7f iomemory:3bc0-3bbf irq:196 memory:91000000-91ffffff memory:3b800000000-3bbffffffff memory:3bc00000000-3bc01ffffff
*-display
description: VGA compatible controller
product: G200eR2
vendor: Matrox Electronics Systems Ltd.
physical id: 0
bus info: pci@0000:09:00.0
logical name: /dev/fb0
version: 01
width: 32 bits
clock: 33MHz
capabilities: pm vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=mgag200 latency=64 maxlatency=32 mingnt=16 resolution=1440,900
resources: irq:19 memory:90000000-90ffffff memory:92800000-92803fff memory:92000000-927fffff memory:c0000-dffff
sudo dmesg | grep nvidia
[ 10.643212] nvidia: loading out-of-tree module taints kernel.
[ 10.643247] nvidia: module license ‘NVIDIA’ taints kernel.
[ 10.663439] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 10.677660] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 10.815162] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.199.02 Thu May 11 11:46:10 UTC 2023
[ 10.818325] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 10.818351] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[ 14.025381] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 14.030120] nvidia-uvm: Loaded the UVM driver, major device number 508.
[ 14.731673] audit: type=1400 audit(1692411947.088:3): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=1587 comm=“apparmor_parser”
[ 14.731678] audit: type=1400 audit(1692411947.088:4): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=1587 comm=“apparmor_parser”
After using the code in NVIDIA driver download page:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
here is the output:
nvidia-smi
Sat Aug 19 02:57:20 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P100-PCIE-16GB On | 00000000:03:00.0 Off | 0 |
| N/A 47C P0 28W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
nvcc --version
Command ‘nvcc’ not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
Solved:
First started with a higher version of Driver and Cuda installation from:
Here is the code I used to install both supplied from the download page above:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
cuda is installed and shows under:
cd /usr/local/cuda
but
nvcc --version #says
Command ‘nvcc’ not found, but can be installed with:
To fix this
echo $PATH
looks like, cuda is not there
/home/cem/anaconda3/bin:/home/cem/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
We will edit the $PATH
nano ~/.bashrc
copy this to the end
export PATH=/usr/local/cuda/bin:$PATH
save and exit (control o, enter, control x)
update the source
source ~/.bashr
check if it worked
which nvcc
/usr/local/cuda/bin/nvcc
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0
Yes, it worked.
But still
import torch
print(torch.version)
print(torch.cuda.is_available())
2.0.1
False
uninstall torch-vision and re-install it
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu122/torch_stable.html
reboot the machine
sudo reboot now
check if it is is intalled
print(“Torch version:”,torch.version)
print(“CUDA version used by PyTorch:”, torch.version.cuda)
print(“CUDA available:”, torch.cuda.is_available())
print(“Number of CUDA devices:”, torch.cuda.device_count())
print(“Current CUDA device:”, torch.cuda.current_device())
Torch version: 2.0.1+cu117
CUDA version used by PyTorch: 11.7
CUDA available: True
Number of CUDA devices: 1
Current CUDA device: 0
Taaaa DAAAaaa !
nvidia-smi cuda version is not matching nvcc --version ???
nvidia-smi shows the highest version of cuda that can be supported by the nvidia driver
So, chilaxxxxxx!