Hi,
I have a dell machine with one A100 Ubuntu 22.04
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
When I install the nvidia drivers they seem to be correctly installed. if I run nvidia-smi I get the following
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:3B:00.0 Off | On |
| N/A 23C P0 31W / 250W | 0MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here cuda 11.8
the kernels seems to run ok
[ 4.960933] nvidia: module license 'NVIDIA' taints kernel.
[ 5.205290] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 520.61.05 Thu Sep 29 05:30:25 UTC 2022
[ 6.683947] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX
platforms 520.61.05 Thu Sep 29 05:29:37 UTC 2022
however if I install pytorch in anaconda env and try to run the following:
>>> torch.version.cuda
'11.8'
>>> torch.cuda.is_available()
/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Beneath are the torch dependencies in the anaconda env
torch 2.1.0+cu118
torchaudio 2.1.0+cu118
torchvision 0.16.0+cu118
I have tried with cuda 11.8,11.4,12.1,12.2 and 12.3 installing the respective torch version in anaconda, venv and global python it all gives the same error.