I have a gpu driver installed and running “nvidia-smi” gives me an output that suggests it is working:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 38C P8 11W / 320W | 445MiB / 16376MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1363 G /usr/lib/xorg/Xorg 252MiB |
| 0 N/A N/A 1565 G /usr/bin/gnome-shell 46MiB |
| 0 N/A N/A 288347 G ...962700789120202471,131072 113MiB |
| 0 N/A N/A 302096 G ...Matlab/bin/glnxa64/MATLAB 3MiB |
| 0 N/A N/A 302398 G ...975D29E5EB0A20ACBBF6FD973 12MiB |
+-----------------------------------------------------------------------------+
If I run “lspci -vnn | grep VGA -A 12”, I get:
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD103 [GeForce RTX 4080] [10de:2704] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:167a]
Flags: bus master, fast devsel, latency 0, IRQ 149
Memory at a1000000 (32-bit, non-prefetchable) [size=16M]
Memory at b0000000 (64-bit, prefetchable) [size=256M]
Memory at c0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bb] (rev a1)
Yet I am unable to get MATLAB or Python libraries such as tensorflow, tensorrt, or pytorch to recognize my GPU. I have even tried using docker images to access my GPU and they work… but don’t use my GPU.
Here are the errors I get from MATLAB when I attempt to use functions from their Parallel Computing Toolbox:
Error using gpuDevice (line 26)
No supported GPU device was found on this computer. To learn more about supported GPU devices, see
www.mathworks.com/gpudevice.
Here are some of the errors I get from Python when I attempt to use my GPU for tensorflow, tensorrt, and pytorch:
Tensorflow errors:
>>>import tensorflow as tf
2023-04-19 15:45:34.614095: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-04-19 15:45:34.614113: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-04-19 15:45:35.291067: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-04-19 15:45:35.291207: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-04-19 15:45:35.291214: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly
>>> tf.config.list_physical_devices('GPU')
2023-04-19 15:51:31.057174: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2023-04-19 15:51:31.057202: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: drew-desktop
2023-04-19 15:51:31.057208: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: drew-desktop
2023-04-19 15:51:31.057263: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 525.105.17
2023-04-19 15:51:31.057281: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 525.105.17
2023-04-19 15:51:31.057287: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 525.105.17
[]
Pytorch errors:
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/drew/.local/lib/python3.10/site-packages/torch/__init__.py", line 229, in <module>
from torch._C import * # noqa: F403
ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory
Tensorrt errors:
>>> import tensorrt
>>> print(tensorrt.__version__)
8.6.0
>>> assert tensorrt.Builder(tensorrt.Logger())
[04/19/2023-16:02:17] [TRT] [W] Unable to determine GPU memory usage
[04/19/2023-16:02:17] [TRT] [W] Unable to determine GPU memory usage
[04/19/2023-16:02:17] [TRT] [W] CUDA initialization failure with error: 999. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: pybind11::init(): factory function returned nullptr
Note: I have tried a ton of things. That’s not to say I’m unwilling to try a ton more, because I really need to get this to work. But to quickly recap some of the more obvious things I have tried: rebooting, creating export/library paths to files/directories mentioned in error/warning messages, reinstalling and rebooting, docker, venv (conda) and trying a few different tensorflow and pytorch packages that are notorious for working with gpu support. I have Ubuntu 22.04.2 LTS.