NVIDIA-TAO-Deploy -pycocotools-fix issue - Python Wheels TAO implementation porting from Collab to AWS SageMaker Studio

Please provide the following information when requesting support.

• Hardware (T4 on AWS Sagemaker Studio)
• Network Type (LPR/LPDNet - )
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I want to trial pre-trained LPD/LPR models as first steps in trialling NVIDIA software for our client’s problems. However faced with some issues. I am trying to set Tao-Toolkit up with python wheels on AWS Sagemaker Studio environment, but met with issues, specifically when installing nvidia-tao-deploy==4.0.0.1. I am trying ton AWS as in other forums posts, there is still no fix to the google collab issue regarding Ubuntu version.

To Reproduce:

Resources i used

  1. Tao Toolkit Starter guide - under “Python Wheels” option
    TAO Toolkit Quick Start Guide - NVIDIA Docs

  2. Python Wheels leads Documentation about running TAO Toolkit on Google Colab:

    1. Running TAO Toolkit on Google Colab - NVIDIA Docs
    2. It is assumed that if it is able to work on Colab, it should work on AWS Sagemaker Studio
  3. Pre-trained Model for Inference Only

    1. Out of All Notebooks found on the above page, we focus attention on this one as it pertains to Inference only with no training
      a. Running TAO Toolkit on Google Colab - NVIDIA Docs
  4. Collab Notebook for Inference Only

    1. Google Colab
    2. This is the notebook that NVIDIA provided to set-up Tao Toolkit using python wheels on Google Collab
    3. This is the one copied into AWS SageMaker Studio environment.
  5. TAO Dependencies
    Release Notes - NVIDIA Docs

    1. My understanding is that we need tao 4.0.1
      a. Why?
      i. In the setup_env_colab.sh file, all the pip install commands have “python3.8”
      ii. Eg:
      1)
      iii. And also, the tao-deploy version is 4……

1. AWS Environment

  • IN AWS Sagemaker studio, there are a limited selection of images available.

option 1: ml.g4dn.xlarge
TensorFlow 2.12.0 Python 3.10 GPU Optimized The AWS Deep Learning Containers for TensorFlow 2.12.0 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see Release Notes for Deep Learning Containers. Ubuntu version: 20.04

Notes:
- The image has python 3.10, but the required Python version is 3.8
To cater for this, we will create a conda environment with python 3.8.

In this environment, I ran through the Tao Deploy notebook - Google Colab. Copied exactly what is required, including manualyl DL and untarring TensorRT.

however, when it came to installing dependencies - I ran into issues. Probably due to the image being Python 3.10 and the dependencies + tao toolkit requiring 3.8. I then created a conda environment with py38 to try to remedy this.

2. Create Conda Environment
Launched Terminal in current sagemaker image via this button in notebook:

Install miniconda:
https://docs.conda.io/projects/miniconda/en/latest/index.html#quick-command-line-install

mkdir -p ~/miniconda3

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

Initialise newly-installed Miniconda:
- ~/miniconda3/bin/conda init bash

Switch to bash:
- bash

Create and activate conda env:
- conda create -n py38_ry_test python=3.8
- conda activate py38_ry_test

Install dependencies
- pip install --upgrade pip
- pip install cython
- pip install nvidia-ml-py
- pip install nvidia-pyindex
- pip install --upgrade setuptools
- conda deactivate

Deactivate conda - THIS WORKED ONE TIME last week… Stated requirement already satisfied… but not this time.
- pip install --pycuda==2020.1 ***********FAIL -
- conda activate py38_ry_test
pip install --pycuda==2020.1 ***********FAIL

3. Trying another image with CUDA 11.8
I then tried another image, this one having pytorch 2.0.0 with cuda 11.8:

PyTorch 2.0.0 Python 3.10 GPU Optimized The AWS Deep Learning Containers for PyTorch 2.0.0 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see Release Notes for Deep Learning Containers. pytorch-2.0.0-gpu-py310 Python 3 Python 3.10

This fixed it - i could install pycuda==2020.1.

I then installed TensorRT like so:
pip install tensorrt==8.5.1.7

This worked.

4. Tao Deploy == 4.0.0.1 Issue
However, I am now met with another issue, the one I am currently stuck on…

pip install nvidia-tao-deploy==4.0.0.1**** FAIL

I will spare my attempts to bug fix this… I have tried everything online, to do with cython, pycocotools, etc… i just can’t get this to work.

Can someone please provide guidance on how to fix this?

May I confirm that in your latest triage, you are using below image from Available Amazon SageMaker Images - Amazon SageMaker, right?

More, for

Could you please upload the full log as a txt file when you run pip install nvidia-tao-deploy==4.0.0.1? Thanks a lot.

Yes that image is the one used.

Uploaded txt.file below
pycocotools_error_tao.txt (21.7 KB)

Thanks. Could you please help to check if it is Ubuntu 20.04?


yep it is!

For above error, could you run as below? Refer to pycocotools/_mask.c:547:21: fatal error: maskApi.h: No such file or directory · Issue #141 · cocodataset/cocoapi · GitHub
$ pip install Cython

Cython is already installed. I’ve tried all the solutions in that post.

Please run following commands and retry.

   $  pip install Cython==0.29.36
   $  sudo apt install libopenmpi-dev
   $  pip install mpi4py
   $  pip install nvidia-tao-deploy==4.0.0.1

BTW, I mimic the aws container with below ways and can reproduce the errors for “maskApi.c” and fix it with above commands.


$ docker run --runtime=nvidia -it nvidia/cuda:11.8.0-devel-ubuntu20.04 /bin/bash
$ apt update
$ apt-get install sudo
$ apt install build-essential libbz2-dev libdb-dev   libreadline-dev libffi-dev libgdbm-dev liblzma-dev   libncursesw5-dev libsqlite3-dev libssl-dev   zlib1g-dev uuid-dev tk-dev wget liblapack-dev   graphviz fonts-humor-sans git  -y
$ export VER=3.10.10
$ wget "https://www.python.org/ftp/python/$VER/Python-$VER.tgz"     && tar -xzvf Python-$VER.tgz     && cd Python-$VER     && ./configure --enable-optimizations --with-lto     && make     && make install
$ cp /Python-3.10.10/python /usr/local/bin/
$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
$ mkdir -p ~/miniconda3
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
$ bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
$ ~/miniconda3/bin/conda init bash
$ sudo cp /root/miniconda3/bin/conda /usr/bin/conda
$ conda create -n launcher python=3.8
$ conda activate launcher
$ pip install --upgrade pip
$ pip install cython
$ pip install nvidia-ml-py
$ pip install nvidia-pyindex
$ pip install --upgrade setuptools
$ pip install pycuda==2020.1
$ pip install https://files.pythonhosted.org/packages/d1/c2/c14dd8884a5bc05ca07331b3d78a92812eb19e25a625a0b59af8b609a93f/nvidia_eff_tao_encryption-0.1.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
$ pip install https://files.pythonhosted.org/packages/cf/ec/47f770919111bcd7047e463389e7f763afbc6ae7b96cbd4be974342a5bb1/nvidia_eff-0.6.2-py38-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
$ pip install cffi
$ pip install nvidia-tao-deploy==4.0.0.1
1 Like

Thank you! This did it for me. I did as suggested. Cython==0.29.36 worked.

I then added my conda environment using:
pip install ipykernel
python -m ipykernel install --user --name env_name --display-name “name of your choosing.”

rebooted the notebook and selected the conda env as kernel.

I pip installed everything again including the ones listed above by yourself @Morganh.
making sure to install this numpy:
!pip install numpy==1.23.4

A few more tweaks to the boilerplate code and i got outputs!

Hoping to trial LPDNet + Vehicles ones soon. Thanks so much.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.