No CUDA-capable device is detected - yolov4

ahaselhan · August 12, 2024, 8:34pm

Please provide the following information when requesting support.

• Hardware (T4)
• Network Type (Yolo_v4)
• TLT Version (TAO 5.0.0)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi, some background info on my issue:

I am trying to run NVIDIA TAO version 5.0.0 to train a yolo v4 model. I am running a VM on google cloud, with a NVIDIA T4 GPU.

I followed the steps on this post: https://docs.nvidia.com/tao/tao-toolkit/text/running_in_cloud/running_tao_toolkit_on_gcp.html

I start running Jupyter from the terminal using this command:

andrewh@us-west4-t4:~$ jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root --NotebookApp.token='password'

I get to step 2.3 and run the following command:

!tao model yolo_v4 dataset_convert -d $SPECS_DIR/yolo_v4_tfrecords_kitti_train.txt \
                             -o $DATA_DOWNLOAD_DIR/yolo_v4/tfrecords/train \
                             -r $USER_EXPERIMENT_DIR/

And get the following output:

2024-08-12 18:37:14,420 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-08-12 18:37:14,513 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-08-12 18:37:14,560 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/andrewh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-08-12 18:37:14,560 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Using TensorFlow backend.
2024-08-12 18:37:17.451564: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-08-12 18:37:17,786 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2024-08-12 18:37:21,340 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:21,470 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:21,489 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:25,637 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2024-08-12 18:37:26,844 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.keras_exporter 36: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
    launch_job(nvidia_tao_tf1.cv.yolo_v4.scripts, "yolo_v4", sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 276, in launch_job
    modules = get_modules(package)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 47, in get_modules
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/export.py", line 21, in <module>
    from nvidia_tao_tf1.cv.yolo_v4.export.yolov4_exporter import YOLOv4Exporter as Exporter
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/export/yolov4_exporter.py", line 42, in <module>
    from nvidia_tao_tf1.cv.common.export.keras_exporter import KerasExporter as Exporter
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py", line 46, in <module>
    from nvidia_tao_tf1.core.export.app import get_model_input_dtype
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/app.py", line 40, in <module>
    from nvidia_tao_tf1.core.export._tensorrt import keras_to_tensorrt
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/_tensorrt.py", line 39, in <module>
    import pycuda.autoinit  # noqa pylint: disable=W0611
  File "/usr/local/lib/python3.8/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
2024-08-12 18:37:28,159 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

nvidia-smi

Mon Aug 12 20:15:05 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    28W /  70W |    514MiB / 15109MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1070      G   /usr/lib/xorg/Xorg                 67MiB |
|    0   N/A  N/A      1926      G   /usr/lib/xorg/Xorg                131MiB |
|    0   N/A  N/A      2053      G   /usr/bin/gnome-shell               27MiB |
|    0   N/A  N/A      2456      C   /usr/NX/bin/nxnode.bin            132MiB |
|    0   N/A  N/A      4758      G   /usr/lib/firefox/firefox          141MiB |
+-----------------------------------------------------------------------------+

dpkg -l | grep cuda

ii  libcudart10.1:amd64                        10.1.243-3                           amd64        NVIDIA CUDA Runtime Library
ii  nvidia-cuda-dev                            10.1.243-3                           amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                            10.1.243-3                           all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                            10.1.243-3                           amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        10.1.243-3                           amd64        NVIDIA CUDA development toolkit

I’ve read the forum post here with a similar issue: No CUDA-capable device is detected on tao detectnet_v2 dataset convert - #4 by NilsAI

But I am unsure if it applies since I think I am running TAO in a different method than the author of the post.

Any advice on how to proceed with this issue would be much appreciated. I apologize in advice, but I am very new to using Linux, so somethings that may be obvious or simple may not be for me. If any more info is needed, please let me know. I am running Ubuntu 20.04.06, 64-bit.

Thanks,
Andrew

Morganh · August 13, 2024, 6:27am

Hi @ahaselhan
Could you open a terminal in the VM and run below ?
andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then, run python .

#python
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()

ahaselhan · August 13, 2024, 4:57pm

Thanks for your reply. Here’s the terminal output:

andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@e866e8f8957c:/workspace# python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>>

Morganh · August 14, 2024, 2:15am

Seems that no gpu is found.
Can you reboot it and retry?
More, can you try another docker?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/pytorch:22.03-py3

ahaselhan · August 14, 2024, 4:36pm

Same issue as before:

andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/pytorch:22.03-py3

=============
== PyTorch ==
=============

NVIDIA Release 22.03 (build 33569136)
PyTorch Version 1.12.0a0+2c916ef

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Morganh · August 15, 2024, 2:01am

The same issue happens in nvcr.io/nvidia/pytorch:22.03-py3. So, it is not related to tao docker.
As mentioned above, please try to reboot it and retry.
More, please try to upgrade the nvidia driver and retry.

Uninstall:
andrewh@us-west4-t4:~$ sudo apt purge nvidia-driver-470
andrewh@us-west4-t4:~$ sudo apt autoremove
andrewh@us-west4-t4:~$ sudo apt autoclean

Install:
andrewh@us-west4-t4:~$sudo apt install nvidia-driver-525

ahaselhan · August 15, 2024, 4:24am

Thanks for the reply. I followed your instructions and rebooted. Ran into the same issue as before:

andrewh@us-west4-t4:~$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement
ERROR: No supported GPU(s) detected to run this container

Failed to detect NVIDIA driver version.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@1ac4ac61cb52:/workspace# python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>> 
root@1ac4ac61cb52:/workspace# exit
(launcher) andrewh@us-west4-t4:~$ 
(launcher) andrewh@us-west4-t4:~$ 
(launcher) andrewh@us-west4-t4:~$ nvidia-smi
Thu Aug 15 04:21:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0              31W /  70W |    439MiB / 15360MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1074      G   /usr/lib/xorg/Xorg                           59MiB |
|    0   N/A  N/A      1880      G   /usr/lib/xorg/Xorg                          123MiB |
|    0   N/A  N/A      2007      G   /usr/bin/gnome-shell                         89MiB |
|    0   N/A  N/A      2434      C   /usr/NX/bin/nxnode.bin                      152MiB |
+---------------------------------------------------------------------------------------+

Morganh · August 15, 2024, 4:53am

Two options.

Is it possible to create a new instance and retry?
For existing instance, please try to follow Install GPU drivers | Compute Engine Documentation | Google Cloud and retry.

ahaselhan · August 15, 2024, 10:31pm

Thank you for taking the time to help with my requests @Morganh. I ended up just starting over with a new instance and everything is working correctly now. The only distinct step I can remember doing differently is using the command:

sudo apt-get -y install nvidia-driver-535

as opposed to

sudo apt-get -y install nvidia-driver-460

which is what was specified in the documents.

However, I believe that there were most likely other changes that I just cannot remember.

Thanks again for you help!

Andrew

Morganh · August 16, 2024, 1:47am

Thanks for the info. Glad to know it is working now.

system · September 16, 2024, 5:21pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
No CUDA-capable device is detected on tao detectnet_v2 dataset convert TAO Toolkit pycuda , omniverse_extension	13	6187	January 4, 2022
No CUDA-capable device is detected TAO Toolkit cuda , tao	9	73	February 17, 2025
TAO Toolkit 5.5.0 - cuInit failed: no CUDA-capable device is detected TAO Toolkit cuda	6	95	January 14, 2025
TA0 v3.21.08 - pycuda._driver.LogicError: cuInit failed: system not yet initialized TAO Toolkit cuda , tao	11	743	April 29, 2024
TAO MaskRCNN inference output problem TAO Toolkit	36	1028	November 30, 2023
TAO data services Error response from daemon: No such container dataset convert error from kitti to COCO TAO Toolkit	14	436	June 11, 2024
Error setting up TAO Toolkit - 'nvidia-docker not found' TAO Toolkit ubuntu , python	28	175	June 26, 2025
Docker - No such container TAO Toolkit	7	65	March 10, 2025
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm) TAO Toolkit	5	163	February 14, 2025
"NvRmMemInitNvmap failed with Permission denied" error when running nvidia-docker in rootless mode on Jetson Orin Nano Jetson Orin Nano jetson-inference	31	350	June 17, 2025

No CUDA-capable device is detected - yolov4

Related topics