nvidia-smi Error: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
There seems to be an NVIDIA driver issue in the A100 40GB VM instances that I spin up in GCP Compute Engine with a boot disk storage container, since nvidia-smi
returns:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Manually installed CUDA Driver
Therefore, I’ve manually installed a CUDA driver by searching on:
Search for driver .run from https://www.nvidia.com/download/driverResults.aspx/191320/en-us/ :
wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
sudo sh NVIDIA-Linux-x86_64-515.65.01.run
Then verified the CUDA driver is installed by:
$ nvidia-smi
Fri Oct 21 10:03:18 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P0 52W / 400W | 0MiB / 40960MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
NVIDIA Driver Error: Found no NVIDIA driver on your system
However, my Python script that loads models to CUDA still errored out with RuntimeError: Found no NVIDIA driver on your system.
. Note that this script runs successfully on GCP P100, T4, V100 GPUs. Please see stack trace below:
File "service.py", line 223, in download_models
config['transformer']['model'][model_name] = model_name_function_mapping[model_name](model).eval().cuda()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 128, in cuda
return super().cuda(device=device)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
return self._apply(lambda t: t.cuda(device))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 215, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Can you please advise how to solve this NVIDIA driver issue? Thank you!