Dell A100 gpu issues nvida driver

Hi,
I have a dell machine with one A100 Ubuntu 22.04

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

When I install the nvidia drivers they seem to be correctly installed. if I run nvidia-smi I get the following

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:3B:00.0 Off |                   On |
| N/A   23C    P0    31W / 250W |      0MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here cuda 11.8
the kernels seems to run ok

[    4.960933] nvidia: module license 'NVIDIA' taints kernel.
[    5.205290] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  520.61.05  Thu Sep 29 05:30:25 UTC 2022
[    6.683947] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX

platforms 520.61.05 Thu Sep 29 05:29:37 UTC 2022

however if I install pytorch in anaconda env and try to run the following:

>>> torch.version.cuda
'11.8'
>>> torch.cuda.is_available()
/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

Beneath are the torch dependencies in the anaconda env

torch                   2.1.0+cu118
torchaudio              2.1.0+cu118
torchvision             0.16.0+cu118

I have tried with cuda 11.8,11.4,12.1,12.2 and 12.3 installing the respective torch version in anaconda, venv and global python it all gives the same error.

Hello @ivan.jacobs and welcome to the NVIDIA developer forums.

Can you check in your anaconda env if you have CUDA_VISIBLE_DEVICES set to the device ID of your GPU? In a single GPU system this should be CUDA_VISIBLE_DEVICES="0".

If that does not help I recommend reviewing the installation instructions on the PyTorch pages in case you missed a step required to enable GPU support. If I recall corrctly PyTorch has some constraints on which torch package version supports specific CUDA versions.

Hi @MarkusHoHo thank you for the fast reaction.
torch finds one device:

>>> torch.cuda.device_count()
1

however I can not access it:

>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 419, in get_device_name
    return get_device_properties(device).name
  File "/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

as per the cuda pytorch versions for all the cuda versions I have tested I installed the corresponding torch version supporting the cuda.
I did the same with tensorflow with similar results.
My gut feeling is that the driver, even though showing that is installed, fails in some way.
Are there tests/analysis I can run to see if the driver and its dependencies are correctly installed and function?