Please provide the following information when requesting support.
• Hardware (T4 on AWS Sagemaker Studio)
• Network Type (LPR/LPDNet - )
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I want to trial pre-trained LPD/LPR models as first steps in trialling NVIDIA software for our client’s problems. However faced with some issues. I am trying to set Tao-Toolkit up with python wheels on AWS Sagemaker Studio environment, but met with issues, specifically when installing nvidia-tao-deploy==4.0.0.1. I am trying ton AWS as in other forums posts, there is still no fix to the google collab issue regarding Ubuntu version.
To Reproduce:
Resources i used
-
Tao Toolkit Starter guide - under “Python Wheels” option
TAO Toolkit Quick Start Guide - NVIDIA Docs -
Python Wheels leads Documentation about running TAO Toolkit on Google Colab:
- Running TAO Toolkit on Google Colab - NVIDIA Docs
- It is assumed that if it is able to work on Colab, it should work on AWS Sagemaker Studio
-
Pre-trained Model for Inference Only
- Out of All Notebooks found on the above page, we focus attention on this one as it pertains to Inference only with no training
a. Running TAO Toolkit on Google Colab - NVIDIA Docs
- Out of All Notebooks found on the above page, we focus attention on this one as it pertains to Inference only with no training
-
Collab Notebook for Inference Only
- Google Colab
- This is the notebook that NVIDIA provided to set-up Tao Toolkit using python wheels on Google Collab
- This is the one copied into AWS SageMaker Studio environment.
-
TAO Dependencies
Release Notes - NVIDIA Docs- My understanding is that we need tao 4.0.1
a. Why?
i. In the setup_env_colab.sh file, all the pip install commands have “python3.8”
ii. Eg:
1)
iii. And also, the tao-deploy version is 4……
- My understanding is that we need tao 4.0.1
1. AWS Environment
- IN AWS Sagemaker studio, there are a limited selection of images available.
option 1: ml.g4dn.xlarge
TensorFlow 2.12.0 Python 3.10 GPU Optimized The AWS Deep Learning Containers for TensorFlow 2.12.0 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see Release Notes for Deep Learning Containers. Ubuntu version: 20.04
Notes:
- The image has python 3.10, but the required Python version is 3.8
To cater for this, we will create a conda environment with python 3.8.
In this environment, I ran through the Tao Deploy notebook - Google Colab. Copied exactly what is required, including manualyl DL and untarring TensorRT.
however, when it came to installing dependencies - I ran into issues. Probably due to the image being Python 3.10 and the dependencies + tao toolkit requiring 3.8. I then created a conda environment with py38 to try to remedy this.
2. Create Conda Environment
Launched Terminal in current sagemaker image via this button in notebook:
Install miniconda:
https://docs.conda.io/projects/miniconda/en/latest/index.html#quick-command-line-install
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
Initialise newly-installed Miniconda:
- ~/miniconda3/bin/conda init bash
Switch to bash:
- bash
Create and activate conda env:
- conda create -n py38_ry_test python=3.8
- conda activate py38_ry_test
Install dependencies
- pip install --upgrade pip
- pip install cython
- pip install nvidia-ml-py
- pip install nvidia-pyindex
- pip install --upgrade setuptools
- conda deactivate
Deactivate conda - THIS WORKED ONE TIME last week… Stated requirement already satisfied… but not this time.
- pip install --pycuda==2020.1 ***********FAIL -
- conda activate py38_ry_test
pip install --pycuda==2020.1 ***********FAIL
3. Trying another image with CUDA 11.8
I then tried another image, this one having pytorch 2.0.0 with cuda 11.8:
PyTorch 2.0.0 Python 3.10 GPU Optimized The AWS Deep Learning Containers for PyTorch 2.0.0 with CUDA 11.8 include containers for training on GPU, optimized for performance and scale on AWS. For more information, see Release Notes for Deep Learning Containers. pytorch-2.0.0-gpu-py310 Python 3 Python 3.10
This fixed it - i could install pycuda==2020.1.
I then installed TensorRT like so:
pip install tensorrt==8.5.1.7
This worked.
4. Tao Deploy == 4.0.0.1 Issue
However, I am now met with another issue, the one I am currently stuck on…
pip install nvidia-tao-deploy==4.0.0.1**** FAIL
I will spare my attempts to bug fix this… I have tried everything online, to do with cython, pycocotools, etc… i just can’t get this to work.
Can someone please provide guidance on how to fix this?