Can't run the provided TAO toolkit sample code

Hi, I found and tried the following sample code on the page: TAO toolkit Getting Started

Yet I wasn’t able to start training as the error message showed that “No module named tensorflow”.

I did check if tensorflow was installed by running !pip list and it was there.
However, after running the following part of the code in part 2.3, tensorflow just “disappeared” as I couldn’t see it on the list shown by !pip list anymore. Also, some other packages such as torch were also missing.

import os
if os.environ["GOOGLE_COLAB"] == "1":
    os.environ["bash_script"] = "setup_env.sh"
else:
    os.environ["bash_script"] = "setup_env_desktop.sh"

!sed -i "s|PATH_TO_COLAB_NOTEBOOKS|$COLAB_NOTEBOOKS_PATH|g" $COLAB_NOTEBOOKS_PATH/tensorflow/$bash_script

!sh $COLAB_NOTEBOOKS_PATH/tensorflow/$bash_script

Do you have any idea? Thanks.

This is a known issue since last week. Please refer to the workaround mentioned in Running tao toolkit in google colab - #17 by Morganh . Thanks.

1 Like

TAO toolkit 5.0 was released yesterday.
Today I tried the sample codes NVIDIA provided again and lots of installations got skipped.

sudo: unable to execute /usr/bin/add-apt-repository: No such file or directory
Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: https://cloud.r-project.org/bin/linux/ubuntu/jammy-cran40/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'libcasa-python3-6' for regex 'python3.6'
Note, selecting 'libpython3.6-stdlib' for regex 'python3.6'
Note, selecting 'python3.6-2to3' for regex 'python3.6'
libcasa-python3-6 is already the newest version (3.4.0-2build1).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-pip is already the newest version (22.0.2+dfsg-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package python3.6-distutils
E: Couldn't find any package by glob 'python3.6-distutils'
E: Couldn't find any package by regex 'python3.6-distutils'
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package python3.6-dev
E: Couldn't find any package by glob 'python3.6-dev'
E: Couldn't find any package by regex 'python3.6-dev'
rm: cannot remove '/usr/bin/python': No such file or directory
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 16: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 17: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 18: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 21: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 22: python3.6: not found
--2023-07-21 05:20:49--  https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/537699/fc11d880-7650-11e9-969f-7442127f007a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230721T052050Z&X-Amz-Expires=300&X-Amz-Signature=3a9860d6ded5f4e74aea8ed0a5081542a1fa9d6322064572ec50749fc8bd6dce&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=537699&response-content-disposition=attachment%3B%20filename%3Dcmake-3.14.4-Linux-x86_64.sh&response-content-type=application%2Foctet-stream [following]
--2023-07-21 05:20:50--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/537699/fc11d880-7650-11e9-969f-7442127f007a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230721%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230721T052050Z&X-Amz-Expires=300&X-Amz-Signature=3a9860d6ded5f4e74aea8ed0a5081542a1fa9d6322064572ec50749fc8bd6dce&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=537699&response-content-disposition=attachment%3B%20filename%3Dcmake-3.14.4-Linux-x86_64.sh&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37196929 (35M) [application/octet-stream]
Saving to: ‘cmake-3.14.4-Linux-x86_64.sh’

cmake-3.14.4-Linux- 100%[===================>]  35.47M   196MB/s    in 0.2s    

2023-07-21 05:20:50 (196 MB/s) - ‘cmake-3.14.4-Linux-x86_64.sh’ saved [37196929/37196929]

CMake Installer Version: 3.14.4, Copyright (c) Kitware
This is a self-extracting archive.
The archive will be extracted to: /usr/local

Using target directory: /usr/local
Extracting, please wait...

Unpacking finished successfully
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 32: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 33: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 34: python3.6: not found
/content/drive/MyDrive/nvidia-tao/tensorflow/setup_env.sh: 37: python3.6: not found

Not sure if it has something to do with the latest update.

Please wait for the official announcement. The colab https://github.com/NVIDIA-AI-IOT/nvidia-tao/tree/main/tensorflow is not the new version yet.

Traceback (most recent call last):
File “/usr/bin/add-apt-repository”, line 363, in
addaptrepo = AddAptRepository()
File “/usr/bin/add-apt-repository”, line 41, in init
self.distro.get_sources(self.sourceslist)
File “/usr/local/lib/python3.10/dist-packages/aptsources/distro.py”, line 91, in get_sources
raise NoDistroTemplateException(
aptsources.distro.NoDistroTemplateException: Error: could not find a distribution template for Ubuntu/jammy

I got this error from today, yesterday it runs and no changes were made.

Please double check or restart the colab notebook.
There is not change in the colab https://github.com/NVIDIA-AI-IOT/nvidia-tao/tree/main/tensorflow

Can you add one cell to check if there is the error?
! sudo add-apt-repository ppa:deadsnakes/ppa

I can also reproduce. Need to check further. It is not related to TAO.

Well I’m not sure if this issue only happens to sample codes that utilize tensorflow.
I couldn’t run both Multi-class Image Classification and Multi-task Image Classification sample codes that utilize tensorflow while I could run ActionRecognitonNet that uses PyTorch.

TAO team is working on this colab issue. Will update to you if any. Thanks.

1 Like

The issue is caused due to an update in OS version to 22.04: add-apt-repository fails in Ubuntu 22.04 but works fine with 'fallback runtime version' · Issue #3867 · googlecolab/colabtools · GitHub
Please use below workaround.
Click on tools->Command Palette-> Use fallback runtime version

Then, below is working now.
!sudo add-apt-repository ppa:deadsnakes/ppa -y

1 Like

Worked, tho it seemed that I had to install a few more modules to make it run correctly.

would you be kind enough to help me with what otehr requirements needed to be installed? seems like im having some real issues setting up the proper environment.

Thank you kindly

@silentjcr Could you share the steps for “a few more modules” you mentioned above?
@castej10 Could you save the .ipynb as an .html file and upload here?

hello @Morganh , I realized i wasnt using the most up-to-date github repository so i deleted the old one from my google drive and re-downloaded it and that fixed many issues, but i still get some errors like

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.21.0, but you have requests 2.27.1 which is incompatible.
google-colab 1.0.0 requires six~=1.12.0, but you have six 1.15.0 which is incompatible.

and

Installing collected packages: zipp, typing-extensions, six, ipython-genutils, decorator, traitlets, setuptools, pyrsistent, importlib-metadata, attrs, wcwidth, tornado, pyzmq, python-dateutil, pyparsing, pycparser, ptyprocess, parso, nest-asyncio, jupyter-core, jsonschema, entrypoints, webencodings, pygments, prompt-toolkit, pickleshare, pexpect, packaging, nbformat, MarkupSafe, jupyter-client, jedi, cffi, backcall, async-generator, testpath, pandocfilters, nbclient, mistune, jupyterlab-pygments, jinja2, ipython, defusedxml, dataclasses, bleach, argon2-cffi-bindings, terminado, Send2Trash, prometheus-client, numpy, nbconvert, ipykernel, argon2-cffi, urllib3, smmap, protobuf, notebook, jmespath, h5py, docutils, widgetsnbextension, termcolor, scipy, qtpy, PyYAML, platformdirs, Pillow, orderedmultidict, onnx, kiwisolver, keras-preprocessing, keras-applications, jupyterlab-widgets, idna, gitdb, cycler, chardet, certifi, botocore, uritemplate, tifffile, threadpoolctl, shortuuid, setproctitle, sentry-sdk, s3transfer, requests, qtconsole, PyWavelets, pytz, pytools, pyjwt, psutil, promise, pathtools, pathlib2, onnxconverter-common, networkx, matplotlib, mako, llvmlite, keras, jupyter-console, joblib, ipywidgets, imageio, GitPython, future, furl, flatbuffers, fire, docker-pycreds, cython, Click, appdirs, xmltodict, wandb, uplink, uff, tqdm, toposort, tf2onnx, tabulate, simplejson, shapely, semver, seaborn, scikit-learn, scikit-image, retrying, requests-toolbelt, recordclass, pycuda, pycocotools-fix, pyarrow, prettytable, posix-ipc, pandas, opencv-python, onnxruntime, onnx-graphsurgeon, nvidia-ml-py, numba, mpi4py, keras2onnx, keras-metrics, jupyter, grpcio, graphsurgeon, DLLogger, cryptography, clearml, boto3, argparse, argcomplete, addict
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
nvidia-tensorflow 1.15.4+nv20.10 requires numpy<1.19.0,>=1.16.0, but you have numpy 1.19.4 which is incompatible.
nvidia-tao 4.0.0 requires idna==2.10, but you have idna 2.7 which is incompatible.
nvidia-tao 4.0.0 requires six==1.15.0, but you have six 1.13.0 which is incompatible.
nvidia-tao 4.0.0 requires tabulate==0.8.7, but you have tabulate 0.7.5 which is incompatible.
nvidia-tao 4.0.0 requires urllib3>=1.26.5, but you have urllib3 1.24.3 which is incompatible.
google-colab 1.0.0 requires ipykernel~=4.6.0, but you have ipykernel 5.5.6 which is incompatible.
google-colab 1.0.0 requires ipython~=5.5.0, but you have ipython 7.16.3 which is incompatible.
google-colab 1.0.0 requires notebook~=5.2.0, but you have notebook 6.4.10 which is incompatible.
google-colab 1.0.0 requires pandas~=0.24.0, but you have pandas 0.25.3 which is incompatible.
google-colab 1.0.0 requires requests~=2.21.0, but you have requests 2.20.1 which is incompatible.
google-colab 1.0.0 requires six~=1.12.0, but you have six 1.13.0 which is incompatible.
google-colab 1.0.0 requires tornado~=4.5.0, but you have tornado 6.1 which is incompatible.

here is the .html file

ssd.html (838.3 KB)

Sorry for the late reply. Wasn’t able to view the content when I was off work.

I actually modified something in setup_env.sh by removing most of the ‘python 3.6’ term in front of all the “pip install” parts.

#!/bin/sh

# Install Python 3.6 as the default version
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt-get update
sudo apt-get install python3.6 -y
apt install python3-pip -y
apt-get install python3.6-distutils
apt-get install python3.6-dev

# Set Python 3.6 as the default version
rm /usr/bin/python
ln -sf /usr/bin/python3.6 /usr/bin/python3
ln -sf /usr/bin/python3.6 /usr/local/bin/python

pip install --upgrade pip
pip install google-colab
pip install nvidia-pyindex
pip install cython==0.27.3

# Install Tensorflow
pip install https://developer.download.nvidia.com/compute/redist/nvidia-horovod/nvidia_horovod-0.20.0+nv20.10-cp36-cp36m-linux_x86_64.whl
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-tensorflow==1.15.4+nv20.10

# Install Cmake
cd /tmp
wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh
chmod +x cmake-3.14.4-Linux-x86_64.sh
./cmake-3.14.4-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir --skip-license
rm ./cmake-3.14.4-Linux-x86_64.sh

# Install dependencies
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-eff==0.5.3
pip install nvidia-tao==4.0.0
pip install --ignore-installed PyYAML -r /content/drive/MyDrive/nvidia-tao/tensorflow/requirements-pip.txt -f https://download.pytorch.org/whl/torch_stable.html --extra-index-url https://developer.download.nvidia.com/compute/redist

# Install code related wheels
pip install nvidia-tao-tf1==4.0.0.657.dev0

Aside from that, I also added a snippet containing the following content as I still couldn’t run tao correctly without them, though I guess I could have also added them to the said .sh file.

!pip install opencv-python==4.4.0.44
!pip install numba
!pip install clearml

Thanks @silentjcr !
@castej10 Could you please check if above way works on your side? Thanks.

Well, now it seems to be PyTorch’s turn…
I tried to train my model using ActionRecognitonNet sample code and tao train didn’t work succesfully.

Train RGB only model with PTM
[NeMo W 2023-07-31 10:07:18 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-07-31 10:07:23 nemo_logging:349] /usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py:81: UserWarning: 
    'train_rgb_3d_finetune.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp57si20o_
Writing /tmp/tmp57si20o_/_remote_module_non_scriptable.py
loading trained weights from /content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt
Error executing job with overrides: ['output_dir=/content/results/rgb_3d_ptm', 'encryption_key=nvidia_tao', 'model_config.rgb_pretrained_model_path=/content/results/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_3d_rgb_hmdb5_32.tlt', 'model_config.rgb_pretrained_num_classes=2']
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 371, in <lambda>
    overrides=args.overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.7/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.action_recognition.scripts.train>", line 77, in main
  File "<frozen cv.action_recognition.scripts.train>", line 28, in run_experiment
  File "<frozen cv.action_recognition.model.pl_ar_model>", line 33, in __init__
  File "<frozen cv.action_recognition.model.pl_ar_model>", line 39, in _build_model
  File "<frozen cv.action_recognition.model.build_nn_model>", line 82, in build_ar_model
  File "<frozen cv.action_recognition.model.ar_model>", line 105, in get_basemodel3d
  File "<frozen cv.action_recognition.model.resnet3d>", line 366, in resnet3d
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet3d:
	size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
	size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.7/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py>", line 3, in <module>
  File "<frozen cv.action_recognition.scripts.train>", line 81, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 103, in wrapper
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 368, in _run_hydra
    lambda: hydra.run(
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: 'str' object has no attribute 'decode'
Execution status: FAIL

Is it normal that the colab notebook installs the tao python 4.0.0 package and not 5.0.0? Or is that what is still being worked on? So its not yet possible to run tao 5.0.0 on colab?