No CUDA-capable device is detected - yolov4

Please provide the following information when requesting support.

• Hardware (T4)
• Network Type (Yolo_v4)
• TLT Version (TAO 5.0.0)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi, some background info on my issue:

I am trying to run NVIDIA TAO version 5.0.0 to train a yolo v4 model. I am running a VM on google cloud, with a NVIDIA T4 GPU.

I followed the steps on this post: https://docs.nvidia.com/tao/tao-toolkit/text/running_in_cloud/running_tao_toolkit_on_gcp.html

I start running Jupyter from the terminal using this command:

andrewh@us-west4-t4:~$ jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root --NotebookApp.token='password'

I get to step 2.3 and run the following command:

!tao model yolo_v4 dataset_convert -d $SPECS_DIR/yolo_v4_tfrecords_kitti_train.txt \
                             -o $DATA_DOWNLOAD_DIR/yolo_v4/tfrecords/train \
                             -r $USER_EXPERIMENT_DIR/

And get the following output:

2024-08-12 18:37:14,420 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-08-12 18:37:14,513 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-08-12 18:37:14,560 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/andrewh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-08-12 18:37:14,560 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Using TensorFlow backend.
2024-08-12 18:37:17.451564: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-08-12 18:37:17,786 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2024-08-12 18:37:21,340 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:21,470 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:21,489 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:25,637 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2024-08-12 18:37:26,844 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.keras_exporter 36: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
    launch_job(nvidia_tao_tf1.cv.yolo_v4.scripts, "yolo_v4", sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 276, in launch_job
    modules = get_modules(package)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 47, in get_modules
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/export.py", line 21, in <module>
    from nvidia_tao_tf1.cv.yolo_v4.export.yolov4_exporter import YOLOv4Exporter as Exporter
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/export/yolov4_exporter.py", line 42, in <module>
    from nvidia_tao_tf1.cv.common.export.keras_exporter import KerasExporter as Exporter
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py", line 46, in <module>
    from nvidia_tao_tf1.core.export.app import get_model_input_dtype
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/app.py", line 40, in <module>
    from nvidia_tao_tf1.core.export._tensorrt import keras_to_tensorrt
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/_tensorrt.py", line 39, in <module>
    import pycuda.autoinit  # noqa pylint: disable=W0611
  File "/usr/local/lib/python3.8/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
2024-08-12 18:37:28,159 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

nvidia-smi

Mon Aug 12 20:15:05 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    28W /  70W |    514MiB / 15109MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1070      G   /usr/lib/xorg/Xorg                 67MiB |
|    0   N/A  N/A      1926      G   /usr/lib/xorg/Xorg                131MiB |
|    0   N/A  N/A      2053      G   /usr/bin/gnome-shell               27MiB |
|    0   N/A  N/A      2456      C   /usr/NX/bin/nxnode.bin            132MiB |
|    0   N/A  N/A      4758      G   /usr/lib/firefox/firefox          141MiB |
+-----------------------------------------------------------------------------+

dpkg -l | grep cuda

ii  libcudart10.1:amd64                        10.1.243-3                           amd64        NVIDIA CUDA Runtime Library
ii  nvidia-cuda-dev                            10.1.243-3                           amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                            10.1.243-3                           all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                            10.1.243-3                           amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        10.1.243-3                           amd64        NVIDIA CUDA development toolkit

I’ve read the forum post here with a similar issue: No CUDA-capable device is detected on tao detectnet_v2 dataset convert - #4 by NilsAI

But I am unsure if it applies since I think I am running TAO in a different method than the author of the post.

Any advice on how to proceed with this issue would be much appreciated. I apologize in advice, but I am very new to using Linux, so somethings that may be obvious or simple may not be for me. If any more info is needed, please let me know. I am running Ubuntu 20.04.06, 64-bit.

Thanks,
Andrew

Hi @ahaselhan
Could you open a terminal in the VM and run below ?
andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then, run python .

#python
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()

Thanks for your reply. Here’s the terminal output:

andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@e866e8f8957c:/workspace# python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>> 

Seems that no gpu is found.
Can you reboot it and retry?
More, can you try another docker?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/pytorch:22.03-py3

Same issue as before:

andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/pytorch:22.03-py3

=============
== PyTorch ==
=============

NVIDIA Release 22.03 (build 33569136)
PyTorch Version 1.12.0a0+2c916ef

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

The same issue happens in nvcr.io/nvidia/pytorch:22.03-py3. So, it is not related to tao docker.
As mentioned above, please try to reboot it and retry.
More, please try to upgrade the nvidia driver and retry.

Uninstall:
andrewh@us-west4-t4:~$ sudo apt purge nvidia-driver-470
andrewh@us-west4-t4:~$ sudo apt autoremove
andrewh@us-west4-t4:~$ sudo apt autoclean

Install:
andrewh@us-west4-t4:~$sudo apt install nvidia-driver-525

Thanks for the reply. I followed your instructions and rebooted. Ran into the same issue as before:

andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@1ac4ac61cb52:/workspace# python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>> 
root@1ac4ac61cb52:/workspace# exit
(launcher) andrewh@us-west4-t4:~$ 
(launcher) andrewh@us-west4-t4:~$ 
(launcher) andrewh@us-west4-t4:~$ nvidia-smi
Thu Aug 15 04:21:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0              31W /  70W |    439MiB / 15360MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1074      G   /usr/lib/xorg/Xorg                           59MiB |
|    0   N/A  N/A      1880      G   /usr/lib/xorg/Xorg                          123MiB |
|    0   N/A  N/A      2007      G   /usr/bin/gnome-shell                         89MiB |
|    0   N/A  N/A      2434      C   /usr/NX/bin/nxnode.bin                      152MiB |
+---------------------------------------------------------------------------------------+

Two options.

  1. Is it possible to create a new instance and retry?
  2. For existing instance, please try to follow Install GPU drivers  |  Compute Engine Documentation  |  Google Cloud and retry.

Thank you for taking the time to help with my requests @Morganh. I ended up just starting over with a new instance and everything is working correctly now. The only distinct step I can remember doing differently is using the command:

sudo apt-get -y install nvidia-driver-535

as opposed to

sudo apt-get -y install nvidia-driver-460

which is what was specified in the documents.

However, I believe that there were most likely other changes that I just cannot remember.

Thanks again for you help!

Andrew

Thanks for the info. Glad to know it is working now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.