Mismatch in python environment on AWS EC2 image

Hello,

I’m having one hell of a time trying to get TAO toolkit working on an AWS EC2 image following the instruction at :

https://docs.nvidia.com/tao/tao-toolkit/text/running_in_cloud/running_tao_toolkit_on_aws.html

i’m at

5. Start an EC2 Virtual Machine Instance. For running TAO Toolkit , use the NVIDIA Deep Learning Amazon Machine Instance (AMI). To use this AMI, select the **AWS Marketplace** and search for the **NVIDIA Deep Learning AMI**.

**Note** The Amazon EC2 P3 and G4 instances are optimized for the NVIDIA Volta/Turing GPUs.

6. Select one of the Amazon EC2 P3 and G4 instance types according to your P3 and G4 instance types.

I have to select one from the following available list:

so my guess is to use:

Q : is this guess correct?

if i copy that ami across to my own account and run it has a version of python installed ( Python 3.8.10 ) that is not supported by the TAO toolkit (python >=3.6.9<3.7 )
Q: Is this correct ? what do i need to do to get this working ?

Small note: in step 2. Once you have logged in, select **Compute** under EC2.
there is no Compute option in the EC2 (anymore?)

Q: If i do use that AMI, what is already installed of the prerequisites and which one isn’t ?

Q: Is there a single docker image available that i can run on that AMI so all prerequisites are already installed ?

Refer to TAO Toolkit Quick Start Guide - NVIDIA Docs
Once you have installed miniconda , create a new environment by setting the Python version to 3.6.

Are there no single end to end steps described here what to do to get a working TAO-Toolkit ?
I tried to pause the video but some of those commands that were entered were on the screen for a few frames. even then i tried to follow them and failed.
Can the exact steps be added to a script or at least a list of the commands please ?

Even the TAO Toolkit Quick Start Guide - NVIDIA Docs

first has a list of Software Requirements:

Software Version ** Comment**
Ubuntu LTS 20.04
python >=3.6.9<3.7 Not needed if you use TAO toolkit API
docker-ce >19.03.5 Not needed if you use TAO toolkit API
docker-API 1.40 Not needed if you use TAO toolkit API
nvidia-container-toolkit >1.3.0-1 Not needed if you use TAO toolkit API
nvidia-container-runtime 3.4.0-1 Not needed if you use TAO toolkit API
nvidia-docker2 2.5.0-1 Not needed if you use TAO toolkit API
nvidia-driver >520 Not needed if you use TAO toolkit API
python-pip >21.06 Not needed if you use TAO toolkit API

i don’t see miniconda in the Software Requirements ?
is this a conditional requirement if python is not the right version ?

do i need to manually find where to install these from or should i ignore the Software Requirements table and follow the “Getting Started” instructions

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O getting_started_v5.0.0.zip
unzip -u getting_started_v5.0.0.zip  -d ./getting_started_v5.0.0 && rm -rf getting_started_v5.0.0.zip && cd ./getting_started_v5.0.0

and then run the setup/quickstart_launcher.sh in that ./getting_started_v5.0.0 directory ?

or both ? And how does the setup/quickstart_launcher.sh know which of the

|–> quickstart_api_bare_metal
|–> quickstart_api_aws_eks
|–> quickstart_api_azure_aks
|–> quickstart_api_gcp_gke

to run ?

Yes, it is conditional requirement.

For installing miniconda, please refer to TAO Toolkit Quick Start Guide - NVIDIA Docs

NVIDIA recommends setting up a python environment using miniconda. The following instructions show how to setup a python conda environment.

  1. Follow the instructions in this link to set up a conda environment using a miniconda.
  2. Once you have installed miniconda, create a new environment by setting the Python version to 3.6.
    conda create -n launcher python=3.6
  3. Activate the conda environment that you have just created.
    conda activate launcher
  4. Once you have activated your conda environment, the command prompt should show the name of your conda environment.
    (launcher) py-3.6.9 desktop:
  5. When you are done with you session, you may deactivate your conda environment using the deactivate command:
    conda deactivate
  6. You may re-instantiate this created conda environment using the following command.
    conda activate launcher

Please check the Software Requirements. To check

  • OS version (Ubuntu 18.04 or 20.04.)
  • If docker is already available(usually yes in user’s machine).
  • nvidia-driver version. For TAO 5.0, you can install 525. ( sudo apt install nvidia-driver-525)
  • If nvidia-docker2 is available. If not, please run sudo apt-get install nvidia-docker2 and sudo systemctl restart docker.service.

Other tip: New computer install GPU Docker error - #6 by david9xqqb

For setup/quickstart_launcher.sh, it is going to install tao launcher only.
Then you can run tao info and tao ssd , etc.

Thank you, I appreciate your help a lot !!

would it be easier for everyone to have all these instructions in 1 location in the documentation that people can follow step by step so people don’t have to ask ?

I’m a bit confused whether to install miniconda or not. it’s not in the requirements, but i guess because the particular NVIdia AMI has a newer version of python installed i should ?

So the software requirement for the nvidia-driver is 525 if installing TAO 5.0. Are there more of these dependent versions ? So is it correct that that particular AMI that i found in the AWS Marketplace has the wrong version of nvidia-driver installed ?

How does everyone else get TAO Toolkit 5.0 running ? I have a feeling i’m missing something (might be some brain cells on my part)

Please refer to TAO Toolkit Quick Start Guide - NVIDIA Docs. It provides the guideline. Also, it mentions that a python environment using miniconda is recommended when python version >= 3.6.9.

For TAO 5.0, the nvidia-smi result is expected to >520. You can check with $nvidia-smi . Actually when it is lower than 520, some networks can also work without any issue. Suggesting >520 is in order to make sure every networks can work.

1 Like

after following the instructions i’ve been able to update my cloudformation.yaml file and it looks like the tao cli is now running.

Description: 'MTData EC2 Training Instance CloudFormation Template'
Parameters:
  KeyName:
    Description: Name of an existing EC2 KeyPair to enable SSH access to the instance
    Type: AWS::EC2::KeyPair::KeyName
    ConstraintDescription: must be the name of an existing EC2 KeyPair.
  JupyterToken:
    Description: Token value used to access the JupyterLab server through the web browser.
    Type: String

Resources:
  EC2Instance:
    Type: AWS::EC2::Instance
    Properties: 
      ImageId: ami-06xxxxxxxxxxxxxx
      KeyName: !Ref 'KeyName'
      InstanceType: g4dn.xlarge 
      SubnetId: subnet-b8811abc
      SecurityGroupIds:
        - sg-009abcd12Afb441bc
      Tags:
        - Key: Name
          Value: a-training-vm
        - Key: Owner
          Value: Tom
        - Key: Environment
          Value: NONPROD
        - Key: Project
          Value: Retraining
        - Key: Customer
          Value: ACustomer
      UserData:
        Fn::Base64: 
          !Sub |
            #!/bin/bash -x 
            {
              echo "Following the Launcher CLI installation instructions at https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#running-tao-toolkit" 
              echo '=== install nvidia docker ===' 
              export distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 
              env > /var/log/my_env.log
              echo '=== checking for network ===' 
              retry=0
              max_retries=5
              while ! ping -c 1 -W 1 8.8.8.8; do
                  sleep 10
                  echo "waiting for network"
                  ((retry++))
                  if [ $retry -ge $max_retries ]; then
                      echo "Network not available, exiting."
                      exit 1
                  fi
              done
              echo '=== install nvidia-docker2 ===' 
              sudo mkdir -p /usr/share/keyrings
              curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | gpg --dearmor | sudo tee /usr/share/keyrings/nvidia-docker-archive-keyring.gpg > /var/log/my_curl.log 2>&1 || echo "Failed here" >> /var/log/my_debug.log
              echo "deb [signed-by=/usr/share/keyrings/nvidia-docker-archive-keyring.gpg] https://nvidia.github.io/nvidia-docker/$$distribution/ nvidia-docker main" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list > /dev/null
              sudo apt-get update 
              sudo apt-get install -y nvidia-docker2  
              echo '=== setup docker password to login to nvcr.io ==='
              export DOCKER_PASSWORD=[this is the real docker password] 
              export DOCKER_USERNAME="\$oauthtoken" 
              export D_TOKEN=$(echo -n "$DOCKER_USERNAME:$DOCKER_PASSWORD" | base64 -w 0) 
              mkdir -p ~/.docker/ && echo "{ \"auths\": { \"nvcr.io\": { \"auth\":\"$D_TOKEN\" } } }" > ~/.docker/config.json 
              echo "{ \"default-runtime\": \"nvidia\", \"runtimes\": { \"nvidia\": { \"path\": \"nvidia-container-runtime\", \"args\": [] } } }" | sudo tee /etc/docker/daemon.json > /dev/null 
              sudo usermod -aG docker root
              sudo usermod -aG docker ubuntu
              sudo systemctl restart docker 
              docker login nvcr.io 
              echo "a quick docker test that fails...."
              sudo docker run --rm --gpus all nvidia/cuda1.0.3-base nvidia-smi 
              echo '=== install miniconda ==='
              export HOME=/root
              mkdir -p ~/miniconda3 
              wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh 
              bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 
              rm -rf ~/miniconda3/miniconda.sh 
              export PATH=/root/miniconda3/bin:$PATH
              echo '=== setup python 3.6 environement and activate'
              conda create -n launcher python=3.6 
              conda init bash
              source /root/miniconda3/etc/profile.d/conda.sh
              conda activate launcher 
              echo '=== install TAO toolkit ==='
              wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O ~/getting_started_v5.0.0.zip 
              unzip -u ~/getting_started_v5.0.0.zip  -d ~/getting_started_v5.0.0 && rm -rf ~/getting_started_v5.0.0.zip 
              echo '=== install TAO Launcher ===' 
              cd ~/getting_started_v5.0.0/ 
              chmod +x ~/getting_started_v5.0.0/setup/quickstart_launcher.sh 
              sudo usermod -aG docker root
              sudo usermod -aG docker ubuntu
              yes | ~/getting_started_v5.0.0/setup/quickstart_launcher.sh --install 
              sudo -H sh -c "yes | ~/getting_started_v5.0.0/setup/quickstart_launcher.sh --install"
              whoami
              sudo tao --help
              sudo which tao
              tao --help
              which tao
              echo "the end"
            } > /var/log/user-data-output.log 2>&1

  
Outputs:
  PublicDnsName:
    Value: !GetAtt [EC2Instance, PublicDnsName]
    Description: Public DNS name

the nvidia-driver version is already at version 528:

# nvidia-smi
Wed Sep 27 02:43:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

so i think this is what i was after with this question.
I hope the cloudformation.yaml will help someone (just update the docker password and the correct AMI )

Thank you @Morganh !!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.