Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4

Hi,

I added Tesla P100 16GB to Dell PowerEdge R730 server, running on Ubuntu 22.04.3 LTS Server.

uname -r
5.15.0-79-generic

uname -a
Linux atlas 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:03:00.0 Off | 0 |
| N/A 45C P0 32W / 250W | 0MiB / 16280MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

nvcc --version
Command ‘nvcc’ not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

When I try to install cuda tool kit, it deletes driver 470 during installation.

How can I install Cuda???

Thank you!

sudo lshw -c video
*-display
description: 3D controller
product: GP100GL [Tesla P100 PCIe 16GB]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:03:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list fb
configuration: depth=32 driver=nvidia latency=0 mode=1440x900 visual=truecolor xres=1440 yres=900
resources: iomemory:3b80-3b7f iomemory:3bc0-3bbf irq:196 memory:91000000-91ffffff memory:3b800000000-3bbffffffff memory:3bc00000000-3bc01ffffff
*-display
description: VGA compatible controller
product: G200eR2
vendor: Matrox Electronics Systems Ltd.
physical id: 0
bus info: pci@0000:09:00.0
logical name: /dev/fb0
version: 01
width: 32 bits
clock: 33MHz
capabilities: pm vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=mgag200 latency=64 maxlatency=32 mingnt=16 resolution=1440,900
resources: irq:19 memory:90000000-90ffffff memory:92800000-92803fff memory:92000000-927fffff memory:c0000-dffff

sudo dmesg | grep nvidia
[ 10.643212] nvidia: loading out-of-tree module taints kernel.
[ 10.643247] nvidia: module license ‘NVIDIA’ taints kernel.
[ 10.663439] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 10.677660] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 10.815162] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.199.02 Thu May 11 11:46:10 UTC 2023
[ 10.818325] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 10.818351] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[ 14.025381] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 14.030120] nvidia-uvm: Loaded the UVM driver, major device number 508.
[ 14.731673] audit: type=1400 audit(1692411947.088:3): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=1587 comm=“apparmor_parser”
[ 14.731678] audit: type=1400 audit(1692411947.088:4): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=1587 comm=“apparmor_parser”

After using the code in NVIDIA driver download page:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

here is the output:

nvidia-smi
Sat Aug 19 02:57:20 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P100-PCIE-16GB On | 00000000:03:00.0 Off | 0 |
| N/A 47C P0 28W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+

nvcc --version
Command ‘nvcc’ not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

Solved:

First started with a higher version of Driver and Cuda installation from:

Here is the code I used to install both supplied from the download page above:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu2204-12-2-local_12.2.1-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda


cuda is installed and shows under:

cd /usr/local/cuda

but

nvcc --version #says
Command ‘nvcc’ not found, but can be installed with:

To fix this

echo $PATH

looks like, cuda is not there

/home/cem/anaconda3/bin:/home/cem/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

We will edit the $PATH

nano ~/.bashrc

copy this to the end

export PATH=/usr/local/cuda/bin:$PATH

save and exit (control o, enter, control x)

update the source

source ~/.bashr

check if it worked

which nvcc
/usr/local/cuda/bin/nvcc

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

Yes, it worked.

But still

import torch
print(torch.version)
print(torch.cuda.is_available())

2.0.1
False

uninstall torch-vision and re-install it

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu122/torch_stable.html

reboot the machine

sudo reboot now


check if it is is intalled

print(“Torch version:”,torch.version)
print(“CUDA version used by PyTorch:”, torch.version.cuda)
print(“CUDA available:”, torch.cuda.is_available())
print(“Number of CUDA devices:”, torch.cuda.device_count())
print(“Current CUDA device:”, torch.cuda.current_device())

Torch version: 2.0.1+cu117
CUDA version used by PyTorch: 11.7
CUDA available: True
Number of CUDA devices: 1
Current CUDA device: 0

Taaaa DAAAaaa !

nvidia-smi cuda version is not matching nvcc --version ???

nvidia-smi shows the highest version of cuda that can be supported by the nvidia driver

So, chilaxxxxxx!