No CUDA-capable device is detected on tao detectnet_v2 dataset convert

Please provide the following information when requesting support.

• Hardware (2080TI)
• Network Type (Detectnet_v2)
• TAO Version (nvidia/tao/tao-toolkit-tf, nvidia/tao/tao-toolkit-pyt, nvidia/tao/tao-toolkit-lm)
• Training spec file(the default from detectnet_v2)
• How to reproduce the issue ?

I followed the steps on TAO Toolkit Quick Start Guide — TAO Toolkit 3.22.05 documentation

and detectnet_v2/detectnet_v2.ipynb from TAO Toolkit Computer Vision Sample Workflows | NVIDIA NGC

Using this container TAO Toolkit for Computer Vision | NVIDIA NGC

First I run the container with:

docker run --gpus all --privileged -it -v /var/run/docker.sock:/var/run/docker.sock --network host nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

Second I follow the steps on Tao Toolkit Quick Start, I get the model (PeopleNet), and I get the Jupyter notebook running. Using:

jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root

I follow the detectnet_v2 notebook without problems until I get to 2.C :

tao detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

It shows me the following error

Converting Tfrecords for kitti trainval dataset
2021-12-06 20:24:46,566 [INFO] root: Registry: ['nvcr.io']
2021-12-06 20:24:46,622 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2021-12-06 20:24:46,690 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/detectnet_v2", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/entrypoint/detectnet_v2.py", line 12, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/export.py", line 8, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/export/exporter.py", line 12, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
2021-12-06 20:24:51,865 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

nvidia-smi

Mon Dec  6 22:23:36 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0A:00.0  On |                  N/A |
| 35%   33C    P8    30W / 260W |   1043MiB / 11016MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

dpkg -l | grep cuda

ii  cuda-command-line-tools-11-3    11.3.1-1                            amd64        CUDA command-line tools
ii  cuda-compat-11-3                465.19.01-1                         amd64        CUDA Compatibility Platform
ii  cuda-compiler-11-3              11.3.1-1                            amd64        CUDA compiler
ii  cuda-cudart-11-3                11.3.109-1                          amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-3            11.3.109-1                          amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-3             11.3.58-1                           amd64        CUDA cuobjdump
ii  cuda-cupti-11-3                 11.3.111-1                          amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-3             11.3.111-1                          amd64        CUDA profiling tools interface.
ii  cuda-cuxxfilt-11-3              11.3.58-1                           amd64        CUDA cuxxfilt
ii  cuda-driver-dev-11-3            11.3.109-1                          amd64        CUDA Driver native dev stub library
ii  cuda-gdb-11-3                   11.3.109-1                          amd64        CUDA-GDB
ii  cuda-libraries-11-3             11.3.1-1                            amd64        CUDA Libraries 11.3 meta-package
ii  cuda-libraries-dev-11-3         11.3.1-1                            amd64        CUDA Libraries 11.3 development meta-package
ii  cuda-memcheck-11-3              11.3.109-1                          amd64        CUDA-MEMCHECK
ii  cuda-minimal-build-11-3         11.3.1-1                            amd64        Minimal CUDA 11.3 toolkit build packages.
ii  cuda-nvcc-11-3                  11.3.109-1                          amd64        CUDA nvcc
ii  cuda-nvdisasm-11-3              11.3.58-1                           amd64        CUDA disassembler
ii  cuda-nvml-dev-11-3              11.3.58-1                           amd64        NVML native dev links, headers
ii  cuda-nvprof-11-3                11.3.111-1                          amd64        CUDA Profiler tools
ii  cuda-nvprune-11-3               11.3.58-1                           amd64        CUDA nvprune
ii  cuda-nvrtc-11-1                 11.1.74-1                           amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-11-3                 11.3.109-1                          amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-1             11.1.74-1                           amd64        NVRTC native dev links, headers
ii  cuda-nvrtc-dev-11-3             11.3.109-1                          amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-3                  11.3.109-1                          amd64        NVIDIA Tools Extension
ii  cuda-sanitizer-11-3             11.3.111-1                          amd64        CUDA Sanitizer
ii  cuda-thrust-11-3                11.3.109-1                          amd64        CUDA Thrust
ii  cuda-toolkit-11-3-config-common 11.3.109-1                          all          Common config package for CUDA Toolkit 11.3.
ii  cuda-toolkit-11-config-common   11.4.108-1                          all          Common config package for CUDA Toolkit 11.
ii  cuda-toolkit-config-common      11.4.108-1                          all          Common config package for CUDA Toolkit.
hi  libcudnn8                       8.2.1.32-1+cuda11.3                 amd64        cuDNN runtime libraries
ii  libcudnn8-dev                   8.2.1.32-1+cuda11.3                 amd64        cuDNN development libraries and headers
hi  libnccl-dev                     2.9.9-1+cuda11.3                    amd64        NVIDIA Collective Communication Library (NCCL) Development Files
hi  libnccl2                        2.9.9-1+cuda11.3                    amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  libnvinfer-bin                  8.0.1-1+cuda11.3                    amd64        TensorRT binaries
ii  libnvinfer-dev                  8.0.1-1+cuda11.3                    amd64        TensorRT development libraries and headers
ii  libnvinfer-plugin-dev           8.0.1-1+cuda11.3                    amd64        TensorRT plugin libraries
ii  libnvinfer-plugin8              8.0.1-1+cuda11.3                    amd64        TensorRT plugin libraries
ii  libnvinfer8                     8.0.1-1+cuda11.3                    amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev            8.0.1-1+cuda11.3                    amd64        TensorRT ONNX libraries
ii  libnvonnxparsers8               8.0.1-1+cuda11.3                    amd64        TensorRT ONNX libraries
ii  libnvparsers-dev                8.0.1-1+cuda11.3                    amd64        TensorRT parsers libraries
ii  libnvparsers8                   8.0.1-1+cuda11.3                    amd64        TensorRT parsers libraries

python cuda.init()

python
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
>>> 

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

I wonder what might be causing this error, the deepstream container runs fine here.

Can you add “--runtime=nvidia” and retry?

For example,
$ docker run --runtime=nvidia -it --rm -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 /bin/bash

1 Like

Hi @Morganh ,

I tried with -- runtime=nvidia, but there is no runtime called nvidia.

docker: Error response from daemon: Unknown runtime specified nvidia.

I am using docker 20.10.10 + nvidia-container-toolkit 1.5.1-1. I read on GitHub that I would not need the nvidia-docker2 (–runtime=nvidia) because the nvidia-container-toolkit is called when I pass the gpus.
It worked for every other Nvidia container so far.

I will install the nvidia-docker2 package and try again.

*Update:

Same error:

Converting Tfrecords for kitti trainval dataset
2021-12-07 18:12:40,514 [INFO] root: Registry: ['nvcr.io']
2021-12-07 18:12:40,569 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2021-12-07 18:12:40,688 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/detectnet_v2", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/entrypoint/detectnet_v2.py", line 12, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/export.py", line 8, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/export/exporter.py", line 12, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
2021-12-07 18:12:45,630 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you run below and share the full log including the commandline?

$ docker run --runtime=nvidia -it --rm -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3  /bin/bash

and then

#python
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
1 Like
docker run --runtime=nvidia -it --rm -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3  /bin/bash           ✔ 

--2021-12-08 01:26:56--  https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)... 13.249.184.59, 13.249.184.95, 13.249.184.53, ...
Connecting to ngc.nvidia.com (ngc.nvidia.com)|13.249.184.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25122731 (24M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’

ngccli_reg_linux.zip                        100%[=========================================================================================>]  23.96M  24.0MB/s    in 1.0s    

2021-12-08 01:26:57 (24.0 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [25122731/25122731]

Archive:  /opt/ngccli/ngccli_reg_linux.zip
  inflating: /opt/ngccli/ngc         
 extracting: /opt/ngccli/ngc.md5     
root@70165256166f:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected

That is quite expected. For it to work I need to use the tag –privileged cgroup issue with nvidia container runtime on Debian testing · Issue #1447 · NVIDIA/nvidia-docker · GitHub
or set the devices CgroupV2 support · Issue #111 · NVIDIA/libnvidia-container · GitHub

 docker run --runtime=nvidia -it --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3  /bin/bash

--2021-12-08 01:32:57--  https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)... 13.249.184.59, 13.249.184.53, 13.249.184.51, ...
Connecting to ngc.nvidia.com (ngc.nvidia.com)|13.249.184.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25122731 (24M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’

ngccli_reg_linux.zip               100%[===============================================================>]  23.96M  24.6MB/s    in 1.0s    

2021-12-08 01:32:59 (24.6 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [25122731/25122731]

Archive:  /opt/ngccli/ngccli_reg_linux.zip
  inflating: /opt/ngccli/ngc         
 extracting: /opt/ngccli/ngc.md5     
root@487e95a16067:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
>>>

I also tried to pass parameters when calling tao. eg

tao detectnet_v2 --runtime=nvidia --gpus=all --privileged dataset_convert 

Same error.

Since cuda.init() is successful when you trigger docker with tag –privileged, can you run dataset_convert directly inside the docker?

i.e.

root@487e95a16067:/workspace# detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
1 Like

Yeah, that works flawlessly!

I guess inside the TAO container, I don’t need to call “tao”.

Thank you very much for your help, @Morganh. I will check the other commands.

Yes, in TAO container, it is not needed to run the launcher.

More info can be found in
https://docs.nvidia.com/tao/tao-toolkit/text/tao_launcher.html#running-the-launcher

1 Like

May I know which OS did you run? Ubuntu or debian?

Manjaro for this thread.

The computer in the office that doesn’t need that is using Ubuntu 18, if I am not mistaken.

So, you can run Manjaro with adding "--privileged " , right?

docker run --runtime=nvidia -it --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 /bin/bash

Yeah, at some moment Arch and Debian based Linux started getting errors like “Failed to initialize NVML: Unknown Error” (I haven’t tried other versions). It was a problem related to cgroups, adding the tag --privileged helped run without the issues.

It seems that they fixed that issue in the experimental branch cgroup issue with nvidia container runtime on Debian testing · Issue #1447 · NVIDIA/nvidia-docker · GitHub

To run on Manjaro I use:

  • docker 20.10
  • nvidia-container-tolkit 1.5.1

docker run --gpus all -it --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 /bin/bash

If you are using docker 19 or newer, you don’t need the flag runtime, only gpus.

For DeepStream I use

docker run --gpus all -it --privileged --rm -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -w /opt/nvidia/deepstream/deepstream-6.0 nvcr.io/nvidia/deepstream:6.0-devel

That might change in the future because it seems that all the nvidia packages are being consolidated under Repository configuration | libnvidia-container

Thanks for the info.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.