Dell A100 gpu issues nvida driver

ivan.jacobs · October 30, 2023, 7:02am

Hi,
I have a dell machine with one A100 Ubuntu 22.04

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

When I install the nvidia drivers they seem to be correctly installed. if I run nvidia-smi I get the following

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:3B:00.0 Off |                   On |
| N/A   23C    P0    31W / 250W |      0MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here cuda 11.8
the kernels seems to run ok

[    4.960933] nvidia: module license 'NVIDIA' taints kernel.
[    5.205290] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  520.61.05  Thu Sep 29 05:30:25 UTC 2022
[    6.683947] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX

platforms 520.61.05 Thu Sep 29 05:29:37 UTC 2022

however if I install pytorch in anaconda env and try to run the following:

>>> torch.version.cuda
'11.8'
>>> torch.cuda.is_available()
/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

Beneath are the torch dependencies in the anaconda env

torch                   2.1.0+cu118
torchaudio              2.1.0+cu118
torchvision             0.16.0+cu118

I have tried with cuda 11.8,11.4,12.1,12.2 and 12.3 installing the respective torch version in anaconda, venv and global python it all gives the same error.

MarkusHoHo · October 30, 2023, 11:22am

Hello @ivan.jacobs and welcome to the NVIDIA developer forums.

Can you check in your anaconda env if you have CUDA_VISIBLE_DEVICES set to the device ID of your GPU? In a single GPU system this should be CUDA_VISIBLE_DEVICES="0".

If that does not help I recommend reviewing the installation instructions on the PyTorch pages in case you missed a step required to enable GPU support. If I recall corrctly PyTorch has some constraints on which torch package version supports specific CUDA versions.

ivan.jacobs · October 31, 2023, 3:58am

Hi @MarkusHoHo thank you for the fast reaction.
torch finds one device:

>>> torch.cuda.device_count()
1

however I can not access it:

>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 419, in get_device_name
    return get_device_properties(device).name
  File "/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/gptftw/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

as per the cuda pytorch versions for all the cuda versions I have tested I installed the corresponding torch version supporting the cuda.
I did the same with tensorflow with similar results.
My gut feeling is that the driver, even though showing that is installed, fails in some way.
Are there tests/analysis I can run to see if the driver and its dependencies are correctly installed and function?

Topic		Replies	Views
Unable to install nvidia-sim on A100 GPU on Ubuntu 22.04 Drivers - Linux, Windows, MacOS	3	2650	July 3, 2023
Linux, CUDA, PyTorch: "Found no NVIDIA driver on your system" CUDA Setup and Installation cuda , pytorch , containers , linux	0	6099	August 5, 2021
Torch.cuda.is_available() return "False" and other error message CUDA Setup and Installation pytorch	0	830	March 13, 2024
NVIDIA RTX A1000 Laptop GPU cuda compatibility General Topics & Other SDKs cuda	3	9964	June 19, 2023
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize TensorRT	2	2727	February 22, 2024
CUDA driver initialization failed CUDA Setup and Installation cuda , ubuntu , python	0	2027	June 7, 2023
Why Nvidia-Cuda drivers never worked in my Ubuntu? CUDA on Windows Subsystem for Linux	1	1375	July 17, 2023
CUDA failed with error CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation cuda	9	5351	July 1, 2024
Cannot install Cuda for my GeForce GTX 1050 Ti on Ubuntu 18.04 CUDA Setup and Installation	8	10657	December 10, 2018
Installing new nvidia drivers and cuda and cudnn on an nvidia geforce 1050 ti? Drivers - Linux, Windows, MacOS cuda , ubuntu , cudnn	2	3765	January 8, 2024

Dell A100 gpu issues nvida driver

Related topics