"Unable to find image 'nvidia/cuda" While Installing TAO Toolkit

• Hardware: RTX 4060 (Laptop)
• CUDA Version: 12.2
• Nvidia Driver: 535.72
• OS: Windows 11 (22H2)
• WSL Version: 2
• Linux on WSL2: Ubuntu 20.04.6 LTS
• Python Version: 3.8.10
• Network: HeartRateNet
• TAO Version: I couldn’t installed that. (I want to use TAO 3.0. If not possible 4.0)
• Training spec file: I don’t have that yet.

• How to reproduce the issue ?

Hello,

First of all, this topic is continuation of a past topic

I am currently trying to follow the TAO Toolkit Quick Start Guide to train HeartRateNet on WSL2. When i enter the command:

sudo docker run --rm --gpus all nvidia/cuda:12.2 base nvidia-smi

I get the following output:

Unable to find image 'nvidia/cuda:12.2' locally
docker: Error response from daemon: manifest for nvidia/cuda:12.2 not found: manifest unknown: manifest unknown.
See 'docker run --help'.

Could you offer your assistance please?

Kindest regards.

The available 12.2 tags can be found on page nvidia/cuda - Docker Image | Docker Hub section “LATEST CUDA 12.2”.
You can use docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi to verify your setup.

Thank you for your kind advice.

Using docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi works.
But, i came across another issue, according to the TAO Toolkit quick start guide, the Python version should be >=3.6.9<3.7. But, the Python version comes with WSL2 is 3.8.10. Do i need to downgrade to 3.6.9 or it will work without issues with 3.8.10? If i need to downgrade could you offer your guidance on how to do this please?

Kindest regards.

The video is a bit old. Please use conda to generate python 3.7 or python 3.8 environment.

Thank you for guidance,

I have managed to create a Python 3.8 environment and even, managed to make to notebook run until !tao heartratenet dataset_convert --experiment_spec_file $DATAIO_SPEC command. But after running the command:

!tao heartratenet dataset_convert --experiment_spec_file $DATAIO_SPEC

I got this output:

HTTP request sent, awaiting response... 
2023-10-19 12:18:01,029 [INFO] root: Registry: ['nvcr.io']
2023-10-19 12:18:01,072 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
2023-10-19 12:18:01,084 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/mericgeren/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
302 Found
Location: https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml [following]
--2023-10-19 12:18:00--  https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 930127 (908K) [text/plain]
Saving to: ‘/home/mericgeren/the_tao_workspace/cv_samples_v1.4.1/heartratenet/heartratenet/data/haarcascade_frontalface_default.xml.12’

     0K .......... .......... .......... .......... ..........  5%  749K 1s
    50K .......... .......... .......... .......... .......... 11% 1.02M 1s
   100K .......... .......... .......... .......... .......... 16% 3.30M 1s
   150K .......... .......... .......... .......... .......... 22% 4.27M 0s
   200K .......... .......... .......... .......... .......... 27% 1.63M 0s
   250K .......... .......... .......... .......... .......... 33% 5.68M 0s
   300K .......... .......... .......... .......... .......... 38% 6.96M 0s
   350K .......... .......... .......... .......... .......... 44% 7.00M 0s
   400K .......... .......... .......... .......... .......... 49% 6.38M 0s
   450K .......... .......... .......... .......... .......... 55% 2.24M 0s
   500K .......... .......... .......... .......... .......... 60% 6.56M 0s
   550K .......... .......... .......... .......... .......... 66% 10.2M 0s
   600K .......... .......... .......... .......... .......... 71% 11.2M 0s
   650K .......... .......... .......... .......... .......... 77% 8.61M 0s
   700K .......... .......... .......... .......... .......... 82% 28.0M 0s
   750K .......... .......... .......... .......... .......... 88% 11.1M 0s
   800K .......... .......... .......... .......... .......... 93% 12.5M 0s
   850K .......... .......... .......... .......... .......... 99% 21.2M 0s
   900K ........                                              100%  189M=0.3s

2023-10-19 12:18:01 (3.45 MB/s) - ‘/home/mericgeren/the_tao_workspace/cv_samples_v1.4.1/heartratenet/heartratenet/data/haarcascade_frontalface_default.xml.12’ saved [930127/930127]

Using TensorFlow backend.
2023-10-19 09:18:03.192036: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
/usr/local/lib/python3.6/dist-packages/driveix/heartratenet/scripts/dataset_convert.py:62: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
[image2 @ 0x7200e00] Could find no file with path '/workspace/tao-experiments/heartratenet/data/cohface_processed/1/0/images/%04d.bmp' and index in the range 0-4
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/driveix/heartratenet/scripts/dataset_convert.py>", line 3, in <module>
  File "<frozen driveix.heartratenet.scripts.dataset_convert>", line 70, in <module>
  File "<frozen driveix.heartratenet.scripts.dataset_convert>", line 64, in main
  File "<frozen driveix.heartratenet.dataio.generate_dataset>", line 121, in preprocess_subjects
  File "<frozen driveix.heartratenet.dataio.generate_dataset>", line 135, in preprocess_subject
  File "<frozen driveix.heartratenet.dataio.generate_dataset>", line 297, in load_subject_files
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1147, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 2293, in __init__
    memory_map=self.memory_map,
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/common.py", line 402, in _get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/tao-experiments/heartratenet/data/cohface_processed/1/0/ground_truth.csv'
Execution status: FAIL
2023-10-19 12:18:11,377 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Later after running the command:

!tao heartratenet train -e $TRAIN_SPEC \
                        -k $KEY \
                        -r $USER_EXPERIMENT_DIR/model

I got this output:

2023-10-19 12:18:12,721 [INFO] root: Registry: ['nvcr.io']
2023-10-19 12:18:12,764 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
2023-10-19 12:18:12,772 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/mericgeren/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
2023-10-19 09:18:14.225673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
/usr/local/lib/python3.6/dist-packages/driveix/heartratenet/scripts/train.py:102: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
/workspace/tao-experiments/heartratenet/model
2023-10-19 09:18:18,625 [INFO] iva.common.logging.logging: Log file already exists at /workspace/tao-experiments/heartratenet/model/status.json
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-10-19 09:18:18,636 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/driveix/heartratenet/scripts/train.py>", line 3, in <module>
  File "<frozen driveix.heartratenet.scripts.train>", line 158, in <module>
  File "<frozen driveix.heartratenet.scripts.train>", line 138, in main
  File "<frozen driveix.heartratenet.trainers.heartratenet_trainer>", line 101, in build
  File "<frozen driveix.heartratenet.dataloader.heartratenet_dataloader>", line 69, in __call__
  File "<frozen driveix.heartratenet.dataloader.heartratenet_dataloader>", line 237, in _get_tfrecords_iterator
  File "<frozen moduluspy.modulus.modulusobject.modulusobject>", line 432, in wrapper
  File "<frozen moduluspy.modulus.processors.tfrecords_iterator>", line 91, in __init__
ValueError: 'shuffle' is True while 'shuffle_buffer_size' is 0.
Execution status: FAIL
2023-10-19 12:18:22,264 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Here is the output of !tao info:

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit']
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

Could you offer your advice please?

Kindest regards.

P.S. I couldn’t get and download COHFACE for now. So, i wanted to try it without COHFACE. Also, may i ask your guidance on where to put the COHFACE please?

Please double check ~/.tao_mounts.json file.
This file will map local files to docker file.

The path $DATAIO_SPEC should be a path inside the docker.
You can check if with
$ tao heartratenet run ls $DATAIO_SPEC

You can also open a terminal to debug inside the docker.
$ tao heartratenet run /bin/bash
then check the file.

Thanks you for your guidance.

I have checked the ~/.tao_mounts.json file and it seems to map local files to docker files:

{
    "Mounts": [
        {
            "source": "/home/mericgeren/the_tao_workspace/cv_samples_v1.4.1/heartratenet",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/mericgeren/cv_samples_v1.4.1/heartratenet/specs",
            "destination": "/workspace/tao-experiments/heartratenet/specs"
        }
    ]
}

For the $DATAIO_SPEC, it’s path looks like a path inside the docker:

/workspace/tao-experiments/heartratenet/specs/heartratenet_data_generation.yaml

But, when i try to run the tao heartratenet run /bin/bash in the notebook, It stucks there. My guess is it passes input to another processes which seems like i can’t access from the terminal or notebook and waits for input. Could you kindly offer your assistance on how can i open a terminal to debug inside the docker while running from notebook please?

Kindest regards.

Not in the notebook. Please open a terminal instead.

Thank you for your kind advice.

I have run the command from a terminal and checked the /workspace/tao-experiments/heartratenet/specs directory. I found two files under the directory:

heartratenet_data_generation.yaml
heartratenet_tlt_pretrain.yaml

Kindest regards.

OK, so double check tao heartratenet dataset_convert again.
From above log, seems that the files are not available.

Thank you so much for all the advice and guidance,

It seems like it’s caused due to trying to test the notebook without access to the dataset. May i ask for your guidance for creating a custom dataset for training the HeartRateNet model please?

Kindest regards.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please refer to Data Annotation Format - NVIDIA Docs
and HeartRateNet | NVIDIA NGC.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.