all CUDA-capable devices are busy or unavailable. What is wrong?

Hello,


I have the following configuration:

(base) msl2@ubuntu18:~/PYTHON_ML$ nvidia-smi
Fri Feb 28 14:46:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650    On   | 00000000:07:00.0 Off |                  N/A |
|  0%   36C    P8     4W /  75W |      0MiB /  3911MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

(base) msl2@ubuntu18:~/PYTHON_ML$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

(base) msl2@ubuntu18:~/PYTHON_ML$ python3 -c 'import tensorflow as tf; print(tf.__version__)' 
2.0.0

And I have Anaconda package installed. If I run a simple Python code to detect GPU I get:

(base) msl2@ubuntu18:~/PYTHON_ML$ python3 ml_test.py 
tf version 2.0.0
2020-02-28 14:51:04.100290: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2020-02-28 14:51:04.124396: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3093065000 Hz
2020-02-28 14:51:04.124802: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5570bf290e40 executing computations on platform Host. Devices:
2020-02-28 14:51:04.124900: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-02-28 14:51:04.126074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 14:51:04.150458: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-28 14:51:04.151102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1650 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:07:00.0
2020-02-28 14:51:04.151394: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-02-28 14:51:04.153240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-02-28 14:51:04.154677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-02-28 14:51:04.155064: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-02-28 14:51:04.157255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-02-28 14:51:04.158971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-02-28 14:51:04.162777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 14:51:04.162932: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-28 14:51:04.163399: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-28 14:51:04.163773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-02-28 14:51:04.163824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
  File "ml_test.py", line 4, in <module>
    if tf.test.is_gpu_available():
  File "/home/msl2/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/test_util.py", line 1432, in is_gpu_available
    for local_device in device_lib.list_local_devices():
  File "/home/msl2/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/device_lib.py", line 41, in list_local_devices
    for s in pywrap_tensorflow.list_devices(session_config=session_config)
  File "/home/msl2/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2249, in list_devices
    return ListDevices()
[b]tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
[/b]

In other works, tensorflow is able to find GPU, but for some reason it is unavailable.
What should I check to fix this problem?

Hi,

How did you install Tensorflow? It seems the pip packages aren’t built with CUDA 10.2 support yet according to here: https://www.tensorflow.org/install/source#gpu

If you installed with pip, that’s likely the reason. I believe you can build TensorFlow from source for CUDA 10.2 support. Alternatively, you could use NGC Containers, which are kept up to date with recent versions and released every month: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

For example, the 20.02-tf2-py3 image was built with TF 2.1 and CUDA 10.2, similarly the 20.01-tf2-py3 image was built with TF 2.0 and CUDA 10.2

I installed Tensorflow using: conda install -c anaconda tensorflow-gpu and it seems that Anaconda’s Tensorflow also supports only CUDA 10.0. So, it looks like you’re absolutely right.

is it worth to downgrade CUDA to 10.0 or I might run into different problem?

Personally, I recommend using containers, but if you prefer to use the host environment only, then downgrading to CUDA 10.0 could be an option.

After downloading/installing CUDA 10.0, it should pretty much be as easy as pointing your PATH and LD_LIBRARY_PATH to the respective /usr/local/cuda-10.0 paths instead of /usr/local/cuda-10.2 paths as described here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup

You’ll also likely need to download the cudnn .tar.gz file built for CUDA 10.0 and copy it into the similar /usr/local/cuda-10.0 paths as described here: https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-tar

I decided to follow the way you recommend and tried to use a container. I had reinstalled Ubuntu 18.04 to make sure that there is no driver conflicts.

  • I installed the latest NVIDIA driver:
NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2
msl2@ubuntu18:~/Downloads$ sudo docker version
Client: Docker Engine - Community
 Version:           19.03.6
 API version:       1.40
 Go version:        go1.12.16
 Git commit:        369ce74a3c
 Built:             Thu Feb 13 01:27:49 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.6
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.16
  Git commit:       369ce74a3c
  Built:            Thu Feb 13 01:26:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
msl2@ubuntu18:~/Downloads$ sudo docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

msl2@ubuntu18:~/Downloads$

So it looks like the Docker is working.
At the next stage I installed nvidia-docker from https://github.com/NVIDIA/nvidia-docker

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

but I I run:

msl2@ubuntu18:~/Downloads$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0001] error waiting for container: context canceled

The same problem happened if I run a basic tensorflow code (I try to run the most recent stable image I pulled with docker pull tensorflow/tensorflow

msl2@ubuntu18:~/Downloads$ sudo docker run -it --rm tensorflow/tensorflow    python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2020-03-02 11:19:06.212580: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-03-02 11:19:06.212661: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-03-02 11:19:06.212674: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-03-02 11:19:06.730777: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-03-02 11:19:06.730802: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-03-02 11:19:06.730844: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2020-03-02 11:19:06.756702: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3109090000 Hz
2020-03-02 11:19:06.757079: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564456264620 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 11:19:06.757112: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
tf.Tensor(43.12938, shape=(), dtype=float32)
msl2@ubuntu18:~/Downloads$

Just to make sure - I do not need to install CUDA toolkit to run the code in the container. The tensorflow/tensorflow or nvidia-container-toolkit should have everything inside and I need to install only NVIDIA driver in the system. Is my understanding correct?

For the nvidia-docker issue, it’s probably fixed by this comment from Renaud, one of the main container devs: https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450

Assuming nvidia-docker is working properly, the errors in the Tensorflow container are likely because the container was not built with GPU (CUDA) support as a base. From glancing at the bottom of this page, https://hub.docker.com/r/tensorflow/tensorflow/, it looks like the default tag for tensorflow/tensorflow is a CPU container only.

All of the NGC container images are built with GPU support in mind, such as https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow, and I would recommend using these. Alternatively, the tensorflow/tensorflow:latest-gpu may work. In general, I would avoid using “latest” tags, as they are commonly/changed updated, so you’re not always running the same image and may not get reproducible results. I tend to stick with the newest branched off tag, so I know exactly what I’m running.

Thank you for your suggestions. First half of the task somehow done (not sure how :).

(base) msl2@ubuntu18:~$ sudo docker run --gpus all nvidia/cuda:10.2-base nvidia-smi
[sudo] password for msl2: 
Mon Mar  2 14:38:42 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650    Off  | 00000000:07:00.0 Off |                  N/A |
| 51%   33C    P0     1W /  75W |      0MiB /  3911MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) msl2@ubuntu18:~$

I assume that this is what suppose to happened. But I have no luck with tensorflow image.

(base) msl2@ubuntu18:~$ sudo docker run -it --rm nvcr.io/nvidia/tensorflow:20.02-tf2-py3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
                                                                                                                                                
================
== TensorFlow ==
================

NVIDIA Release 20.02-tf2 (build 9892252)
TensorFlow Version 2.1.0

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

2020-03-02 14:42:21.042233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-03-02 14:42:21.811238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2020-03-02 14:42:21.812093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
2020-03-02 14:42:22.456781: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-03-02 14:42:22.456811: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-03-02 14:42:22.456837: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2020-03-02 14:42:22.484636: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3109090000 Hz
2020-03-02 14:42:22.485147: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5696510 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-02 14:42:22.485220: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
tf.Tensor(-589.91016, shape=(), dtype=float32)
(base) msl2@ubuntu18:~$ nvidia-smi
Mon Mar  2 07:45:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650    Off  | 00000000:07:00.0 Off |                  N/A |
| 51%   34C    P0     1W /  75W |      0MiB /  3911MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) msl2@ubuntu18:~$

It says that NVIDIA driver was not detected, but it is in the system and running (see the bottom of the code window).
It also says, “Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64”
But CUDA should be built in into the image already?

Hi,

Looks like you omitted the “–gpus all” in the Tensorflow container docker run command, this is probably why it said no driver detected.

Let me know if that works for you.

Hello,

yes, that was a reason. Thank you for your supervision, hopefully container will work for me.