(base) msl2@ubuntu18:~/PYTHON_ML$ nvidia-smi
Fri Feb 28 14:46:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 On | 00000000:07:00.0 Off | N/A |
| 0% 36C P8 4W / 75W | 0MiB / 3911MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(base) msl2@ubuntu18:~/PYTHON_ML$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
(base) msl2@ubuntu18:~/PYTHON_ML$ python3 -c 'import tensorflow as tf; print(tf.__version__)'
2.0.0
And I have Anaconda package installed. If I run a simple Python code to detect GPU I get:
(base) msl2@ubuntu18:~/PYTHON_ML$ python3 ml_test.py
tf version 2.0.0
2020-02-28 14:51:04.100290: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2020-02-28 14:51:04.124396: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3093065000 Hz
2020-02-28 14:51:04.124802: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5570bf290e40 executing computations on platform Host. Devices:
2020-02-28 14:51:04.124900: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2020-02-28 14:51:04.126074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 14:51:04.150458: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-28 14:51:04.151102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:07:00.0
2020-02-28 14:51:04.151394: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-02-28 14:51:04.153240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-02-28 14:51:04.154677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-02-28 14:51:04.155064: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-02-28 14:51:04.157255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-02-28 14:51:04.158971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-02-28 14:51:04.162777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 14:51:04.162932: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-28 14:51:04.163399: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-28 14:51:04.163773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-02-28 14:51:04.163824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File "ml_test.py", line 4, in <module>
if tf.test.is_gpu_available():
File "/home/msl2/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/test_util.py", line 1432, in is_gpu_available
for local_device in device_lib.list_local_devices():
File "/home/msl2/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/device_lib.py", line 41, in list_local_devices
for s in pywrap_tensorflow.list_devices(session_config=session_config)
File "/home/msl2/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2249, in list_devices
return ListDevices()
[b]tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
[/b]
In other works, tensorflow is able to find GPU, but for some reason it is unavailable.
What should I check to fix this problem?
How did you install Tensorflow? It seems the pip packages aren’t built with CUDA 10.2 support yet according to here: 從原始碼開始建構 | TensorFlow
If you installed with pip, that’s likely the reason. I believe you can build TensorFlow from source for CUDA 10.2 support. Alternatively, you could use NGC Containers, which are kept up to date with recent versions and released every month: TensorFlow | NVIDIA NGC
For example, the 20.02-tf2-py3 image was built with TF 2.1 and CUDA 10.2, similarly the 20.01-tf2-py3 image was built with TF 2.0 and CUDA 10.2
I installed Tensorflow using: conda install -c anaconda tensorflow-gpu and it seems that Anaconda’s Tensorflow also supports only CUDA 10.0. So, it looks like you’re absolutely right.
is it worth to downgrade CUDA to 10.0 or I might run into different problem?
Personally, I recommend using containers, but if you prefer to use the host environment only, then downgrading to CUDA 10.0 could be an option.
After downloading/installing CUDA 10.0, it should pretty much be as easy as pointing your PATH and LD_LIBRARY_PATH to the respective /usr/local/cuda-10.0 paths instead of /usr/local/cuda-10.2 paths as described here: Installation Guide Linux :: CUDA Toolkit Documentation
msl2@ubuntu18:~/Downloads$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0001] error waiting for container: context canceled
The same problem happened if I run a basic tensorflow code (I try to run the most recent stable image I pulled with docker pull tensorflow/tensorflow
msl2@ubuntu18:~/Downloads$ sudo docker run -it --rm tensorflow/tensorflow python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2020-03-02 11:19:06.212580: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-03-02 11:19:06.212661: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-03-02 11:19:06.212674: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-03-02 11:19:06.730777: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-03-02 11:19:06.730802: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-03-02 11:19:06.730844: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2020-03-02 11:19:06.756702: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3109090000 Hz
2020-03-02 11:19:06.757079: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564456264620 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 11:19:06.757112: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
tf.Tensor(43.12938, shape=(), dtype=float32)
msl2@ubuntu18:~/Downloads$
Just to make sure - I do not need to install CUDA toolkit to run the code in the container. The tensorflow/tensorflow or nvidia-container-toolkit should have everything inside and I need to install only NVIDIA driver in the system. Is my understanding correct?
Assuming nvidia-docker is working properly, the errors in the Tensorflow container are likely because the container was not built with GPU (CUDA) support as a base. From glancing at the bottom of this page, Docker Hub, it looks like the default tag for tensorflow/tensorflow is a CPU container only.
All of the NGC container images are built with GPU support in mind, such as TensorFlow | NVIDIA NGC, and I would recommend using these. Alternatively, the tensorflow/tensorflow:latest-gpu may work. In general, I would avoid using “latest” tags, as they are commonly/changed updated, so you’re not always running the same image and may not get reproducible results. I tend to stick with the newest branched off tag, so I know exactly what I’m running.
Thank you for your suggestions. First half of the task somehow done (not sure how :).
(base) msl2@ubuntu18:~$ sudo docker run --gpus all nvidia/cuda:10.2-base nvidia-smi
[sudo] password for msl2:
Mon Mar 2 14:38:42 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 Off | 00000000:07:00.0 Off | N/A |
| 51% 33C P0 1W / 75W | 0MiB / 3911MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(base) msl2@ubuntu18:~$
I assume that this is what suppose to happened. But I have no luck with tensorflow image.
(base) msl2@ubuntu18:~$ sudo docker run -it --rm nvcr.io/nvidia/tensorflow:20.02-tf2-py3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
================
== TensorFlow ==
================
NVIDIA Release 20.02-tf2 (build 9892252)
TensorFlow Version 2.1.0
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use 'nvidia-docker run' to start this container; see
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TensorFlow. NVIDIA recommends the use of the following flags:
nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...
2020-03-02 14:42:21.042233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-03-02 14:42:21.811238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2020-03-02 14:42:21.812093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
2020-03-02 14:42:22.456781: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-03-02 14:42:22.456811: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-03-02 14:42:22.456837: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2020-03-02 14:42:22.484636: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3109090000 Hz
2020-03-02 14:42:22.485147: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5696510 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 14:42:22.485220: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
tf.Tensor(-589.91016, shape=(), dtype=float32)
(base) msl2@ubuntu18:~$ nvidia-smi
Mon Mar 2 07:45:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 Off | 00000000:07:00.0 Off | N/A |
| 51% 34C P0 1W / 75W | 0MiB / 3911MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(base) msl2@ubuntu18:~$
It says that NVIDIA driver was not detected, but it is in the system and running (see the bottom of the code window).
It also says, “Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64”
But CUDA should be built in into the image already?
I have kind of same issue. I have tensorRT model and another tf model, when I run that in trt docker container. I got this same error. Any suggestion will be very helpful. Thanks!