after following the instructions i’ve been able to update my cloudformation.yaml file and it looks like the tao cli is now running.
Description: 'MTData EC2 Training Instance CloudFormation Template'
Parameters:
KeyName:
Description: Name of an existing EC2 KeyPair to enable SSH access to the instance
Type: AWS::EC2::KeyPair::KeyName
ConstraintDescription: must be the name of an existing EC2 KeyPair.
JupyterToken:
Description: Token value used to access the JupyterLab server through the web browser.
Type: String
Resources:
EC2Instance:
Type: AWS::EC2::Instance
Properties:
ImageId: ami-06xxxxxxxxxxxxxx
KeyName: !Ref 'KeyName'
InstanceType: g4dn.xlarge
SubnetId: subnet-b8811abc
SecurityGroupIds:
- sg-009abcd12Afb441bc
Tags:
- Key: Name
Value: a-training-vm
- Key: Owner
Value: Tom
- Key: Environment
Value: NONPROD
- Key: Project
Value: Retraining
- Key: Customer
Value: ACustomer
UserData:
Fn::Base64:
!Sub |
#!/bin/bash -x
{
echo "Following the Launcher CLI installation instructions at https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#running-tao-toolkit"
echo '=== install nvidia docker ==='
export distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
env > /var/log/my_env.log
echo '=== checking for network ==='
retry=0
max_retries=5
while ! ping -c 1 -W 1 8.8.8.8; do
sleep 10
echo "waiting for network"
((retry++))
if [ $retry -ge $max_retries ]; then
echo "Network not available, exiting."
exit 1
fi
done
echo '=== install nvidia-docker2 ==='
sudo mkdir -p /usr/share/keyrings
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | gpg --dearmor | sudo tee /usr/share/keyrings/nvidia-docker-archive-keyring.gpg > /var/log/my_curl.log 2>&1 || echo "Failed here" >> /var/log/my_debug.log
echo "deb [signed-by=/usr/share/keyrings/nvidia-docker-archive-keyring.gpg] https://nvidia.github.io/nvidia-docker/$$distribution/ nvidia-docker main" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y nvidia-docker2
echo '=== setup docker password to login to nvcr.io ==='
export DOCKER_PASSWORD=[this is the real docker password]
export DOCKER_USERNAME="\$oauthtoken"
export D_TOKEN=$(echo -n "$DOCKER_USERNAME:$DOCKER_PASSWORD" | base64 -w 0)
mkdir -p ~/.docker/ && echo "{ \"auths\": { \"nvcr.io\": { \"auth\":\"$D_TOKEN\" } } }" > ~/.docker/config.json
echo "{ \"default-runtime\": \"nvidia\", \"runtimes\": { \"nvidia\": { \"path\": \"nvidia-container-runtime\", \"args\": [] } } }" | sudo tee /etc/docker/daemon.json > /dev/null
sudo usermod -aG docker root
sudo usermod -aG docker ubuntu
sudo systemctl restart docker
docker login nvcr.io
echo "a quick docker test that fails...."
sudo docker run --rm --gpus all nvidia/cuda1.0.3-base nvidia-smi
echo '=== install miniconda ==='
export HOME=/root
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
export PATH=/root/miniconda3/bin:$PATH
echo '=== setup python 3.6 environement and activate'
conda create -n launcher python=3.6
conda init bash
source /root/miniconda3/etc/profile.d/conda.sh
conda activate launcher
echo '=== install TAO toolkit ==='
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O ~/getting_started_v5.0.0.zip
unzip -u ~/getting_started_v5.0.0.zip -d ~/getting_started_v5.0.0 && rm -rf ~/getting_started_v5.0.0.zip
echo '=== install TAO Launcher ==='
cd ~/getting_started_v5.0.0/
chmod +x ~/getting_started_v5.0.0/setup/quickstart_launcher.sh
sudo usermod -aG docker root
sudo usermod -aG docker ubuntu
yes | ~/getting_started_v5.0.0/setup/quickstart_launcher.sh --install
sudo -H sh -c "yes | ~/getting_started_v5.0.0/setup/quickstart_launcher.sh --install"
whoami
sudo tao --help
sudo which tao
tao --help
which tao
echo "the end"
} > /var/log/user-data-output.log 2>&1
Outputs:
PublicDnsName:
Value: !GetAtt [EC2Instance, PublicDnsName]
Description: Public DNS name
the nvidia-driver version is already at version 528:
# nvidia-smi
Wed Sep 27 02:43:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
so i think this is what i was after with this question.
I hope the cloudformation.yaml will help someone (just update the docker password and the correct AMI )
Thank you @Morganh !!