A100 GPU on GCP: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.", "Found no NVIDIA driver on your system."

glau-ml · October 21, 2022, 11:58am

nvidia-smi Error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
There seems to be an NVIDIA driver issue in the A100 40GB VM instances that I spin up in GCP Compute Engine with a boot disk storage container, since nvidia-smi returns:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Manually installed CUDA Driver
Therefore, I’ve manually installed a CUDA driver by searching on:
Search for driver .run from https://www.nvidia.com/download/driverResults.aspx/191320/en-us/ :

wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
sudo sh NVIDIA-Linux-x86_64-515.65.01.run

Then verified the CUDA driver is installed by:

$ nvidia-smi
Fri Oct 21 10:03:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    52W / 400W |      0MiB / 40960MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

NVIDIA Driver Error: Found no NVIDIA driver on your system
However, my Python script that loads models to CUDA still errored out with RuntimeError: Found no NVIDIA driver on your system.. Note that this script runs successfully on GCP P100, T4, V100 GPUs. Please see stack trace below:

  File "service.py", line 223, in download_models
    config['transformer']['model'][model_name] = model_name_function_mapping[model_name](model).eval().cuda()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 128, in cuda
    return super().cuda(device=device)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Can you please advise how to solve this NVIDIA driver issue? Thank you!

Topic		Replies	Views
GPU not found - NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA Setup and Installation omniverse_extension	2	916	October 8, 2021
A100 Nvidia-smi fails Ubuntu 22.04 Linux ubuntu , nvidia-smi , a100	3	991	June 3, 2024
Can't use any NVIDIA driver on Ubuntu 18.04 (4.15.0-39-generic) Linux	7	20588	October 12, 2021
Ubuntu 18.04 Quadro P2000 "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Linux	6	2987	May 4, 2019
After installing CUDA and a reboot: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver." CUDA Setup and Installation driver	1	964	August 4, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	2	5746	August 16, 2019
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux ubuntu , cudnn , nvidia-smi	3	787	October 17, 2024
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running CUDA Setup and Installation	1	1267	April 14, 2022
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA Setup and Installation	0	1097	December 26, 2017
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux cuda	0	1449	July 1, 2020

A100 GPU on GCP: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.", "Found no NVIDIA driver on your system."

Related topics