TA0 v3.21.08 - pycuda._driver.LogicError: cuInit failed: system not yet initialized

Please provide the following information when requesting support.

Hardware
NVIDIA A100-SXM4-40GB
• Network Type
YOLOv4
• TAO Version

tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

Hi there,

I am trying to train YOLOv4 on a AWS P4 instance created from the NVIDIA Deep Learning Base AMI 2024.03.4-676eed8d-dcf5-4784-87d7-0de463205c17.
I thought everything should run smoothly but it is not the case.

When trying to start a training with tao yolo_v4 train, I am getting the following error:

tao yolo_v4 train
2024-04-23 06:08:09,250 [INFO] root: Registry: ['nvcr.io']
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-9z6ezlfr because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/export.py", line 8, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/export/yolov4_exporter.py", line 31, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.LogicError: cuInit failed: system not yet initialized
2024-04-23 06:08:13,150 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I’ve followed another topic pycuda-driver-logicerror-cuinit-failed-system-not-yet-initialized/ and i’ve tried to run pycuda from the container

docker run --gpus all --entrypoint ""  -it -v /home/ubuntu/tao/:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 /bin/bash

But I am getting

root@fae8148ba2ce:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.LogicError: cuInit failed: system not yet initialized

I’ve also installed
sudo apt-get install nvidia-modprobe

I’ve run TAO 2 weeks ago from another p4 EC2 and I had no issue so I am not sure what is going on. To install I’ve follow the instructions:

sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7

export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.7
export VIRTUALENVWRAPPER_VIRTUALENV=/home/ubuntu/.local/bin/virtualenv
export WORKON_HOME=$HOME/.virtualenvs
source /home/ubuntu/.local/bin/virtualenvwrapper.sh

mkvirtualenv tao-v3.21.08
(tao-v3.21.08) pip install nvidia-pyindex
(tao-v3.21.08) pip install nvidia-tao==0.1.19

(tao-v3.21.08) python --version
Python 3.7.17
(tao-v3.21.08)  tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

NVIDIA/Cuda Info:

root@3cbe58ae05b8:/workspace# nvidia-smi
Tue Apr 23 06:40:59 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:10:1C.0 Off |                    0 |
| N/A   33C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:10:1D.0 Off |                    0 |
| N/A   30C    P0              42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1C.0 Off |                    0 |
| N/A   31C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1D.0 Off |                    0 |
| N/A   29C    P0              42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  | 00000000:90:1C.0 Off |                    0 |
| N/A   32C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  | 00000000:90:1D.0 Off |                    0 |
| N/A   30C    P0              45W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  | 00000000:A0:1C.0 Off |                    0 |
| N/A   33C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  | 00000000:A0:1D.0 Off |                    0 |
| N/A   30C    P0              42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
root@3cbe58ae05b8:/workspace# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
root@3cbe58ae05b8:/workspace# dpkg -l |grep cuda
ii  cuda-command-line-tools-11-1  11.1.1-1                            amd64        CUDA command-line tools
ii  cuda-compat-11-1              455.45.01-1                         amd64        CUDA Compatibility Platform
ii  cuda-compiler-11-1            11.1.1-1                            amd64        CUDA compiler
ii  cuda-cudart-11-1              11.1.74-1                           amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-1          11.1.74-1                           amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-1           11.1.74-1                           amd64        CUDA cuobjdump
ii  cuda-cupti-11-1               11.1.105-1                          amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-1           11.1.105-1                          amd64        CUDA profiling tools interface.
ii  cuda-driver-dev-11-1          11.1.74-1                           amd64        CUDA Driver native dev stub library
ii  cuda-gdb-11-1                 11.1.105-1                          amd64        CUDA-GDB
ii  cuda-libraries-11-1           11.1.1-1                            amd64        CUDA Libraries 11.1 meta-package
ii  cuda-libraries-dev-11-1       11.1.1-1                            amd64        CUDA Libraries 11.1 development meta-package
ii  cuda-memcheck-11-1            11.1.105-1                          amd64        CUDA-MEMCHECK
ii  cuda-minimal-build-11-1       11.1.1-1                            amd64        Minimal CUDA 11.1 toolkit build packages.
ii  cuda-nvcc-11-1                11.1.105-1                          amd64        CUDA nvcc
ii  cuda-nvdisasm-11-1            11.1.74-1                           amd64        CUDA disassembler
ii  cuda-nvml-dev-11-1            11.1.74-1                           amd64        NVML native dev links, headers
ii  cuda-nvprof-11-1              11.1.105-1                          amd64        CUDA Profiler tools
ii  cuda-nvprune-11-1             11.1.74-1                           amd64        CUDA nvprune
ii  cuda-nvrtc-11-1               11.1.105-1                          amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-1           11.1.105-1                          amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-1                11.1.74-1                           amd64        NVIDIA Tools Extension
ii  cuda-sanitizer-11-1           11.1.105-1                          amd64        CUDA Sanitizer
hi  libcudnn8                     8.1.1.33-1+cuda11.2                 amd64        cuDNN runtime libraries
ii  libcudnn8-dev                 8.1.1.33-1+cuda11.2                 amd64        cuDNN development libraries and headers
hi  libnccl-dev                   2.7.8-1+cuda11.1                    amd64        NVIDIA Collectives Communication Library (NCCL) Development Files
hi  libnccl2                      2.7.8-1+cuda11.1                    amd64        NVIDIA Collectives Communication Library (NCCL) Runtime
ii  libnvinfer-dev                7.2.3-1+cuda11.1                    amd64        TensorRT development libraries and headers
ii  libnvinfer-plugin-dev         7.2.3-1+cuda11.1                    amd64        TensorRT plugin libraries
ii  libnvinfer-plugin7            7.2.3-1+cuda11.1                    amd64        TensorRT plugin libraries
ii  libnvinfer7                   7.2.3-1+cuda11.1                    amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev          7.2.3-1+cuda11.1                    amd64        TensorRT ONNX libraries
ii  libnvonnxparsers7             7.2.3-1+cuda11.1                    amd64        TensorRT ONNX libraries
ii  libnvparsers-dev              7.2.3-1+cuda11.1                    amd64        TensorRT parsers libraries
ii  libnvparsers7                 7.2.3-1+cuda11.1                    amd64        TensorRT parsers libraries

Any idea? Thanks for the help.

I’ve actually fixed the issue by installing the cuda drivers
sudo apt-get install cuda-drivers-535

aren’t they supposed to be installed already when using the Nvidia Deep Learning AMI?

Thanks for the info. Usually this kind of error can be fixed after reinstalling the nvidia-driver again.

sudo apt purge nvidia-driver-535
sudo apt autoremove
sudo apt autoclean
sudo apt install nvidia-driver-535

Isn’t the goal of using the NVIDIA AMI to not have to do those steps though?

Should have installed the driver. You can trigger a new one to double check.
$ python

>>>import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()

i’e tried twice to launch the EC2 with the NVIDIA AMI. And got twice the same CUDA init error.
Was the AMI updated in the past 2 weeks? Does it need to get fixed?

So, just want to confirm, there is no issue in p4 EC2 but there is always the error in NVIDIA AMI, right? Could you compare below info? Thanks a lot!
$ nvidia-smi
$ dpkg -l |grep cuda

I am not sure if there is a problem with the NVIDIA AMI.
Find below exactly what I did:

  1. 2 weeks ago: I’ve launched a EC2 P4 instance from NVIDIA AMI, installed TAO and run a training without problem.
  2. Yesterday:
    a. I’ve launched a EC2 P4 instance from NVIDIA AMI, installed TAO and got the CUDA init error.
    b. I’ve launched another EC2 P4 instance from NVIDIA AMI, installed TAO and got the CUDA init error.
    c. Then on that 2nd EC2, I’ve installed the CUDA drivers, ixed the CUDA init error and run a training

Thanks for the info. Indeed, it is not sure that it is a problem from NVIDIA AMI.
Next time, you can run below directly to check.
$ nvidia-smi
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then inside the docker, run

$ python
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()

Hi Morgan,
Yes that’s what I did. Running the container and testing in python for cudat.init().
But I was getting the same initialisation error.

Do you know the AMI has been updated recently. Is it possible to know the last release date?

For more info about AMI, you can refer to NGC on AWS Virtual Machines - NVIDIA Docs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.