Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc): GeForce RTX 3090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): TAO 4.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Hi All,
I would like to deploy TAO 4.0 and utilize the new AutoML capabilities in standalone mode without needing to have master-slave node structure. I know that jupyter notebooks that are provided within the Getting Started Guide explains the baremetal deployment but still it looks like it forces me to deploy the whole solution using Ansible and Kubernetes. Is there any other way to deploy AutoML services and make it work with the TAO Toolkit without needing Ansible and Kubernetes?
Then don’t you think that there should be a modified script for local deployments which doesn’t wipe out existing NVIDIA driver, docker volumes etc.?
I know that I can eliminate these sort of things if I go all the way down to the yaml files, test out ansible commands individually and comment out specific parts but it still would be a pain for starters.
However, I’m stuck at “Waiting for the Cluster to become available” step and it is not clear how can I troubleshoot the problem that is preventing the deployment. I also created a new Ubuntu user and added it to sudoers file so that TAO API Deployment process wouldn’t need to asked for a sudo pass. What is the right way to debug/troubleshoot this issue?
Usually you can grep the log to find where gets stucked.
For “Waiting for the Cluster to become available”, running below can find the information.
$ grep -r “Waiting for the Cluster to become available”
cnc/cnc-validation.yaml: - name: Waiting for the Cluster to become available
Then, check the file cnc/cnc-validation.yaml.
- name: Waiting for the Cluster to become available
args:
executable: /bin/bash
shell: |
state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v 'Running|Completed|NAME' | wc -l)
while [ $state != 0 ]
do
sleep 10
state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v 'Running|Completed|NAME' | wc -l)
done
register: status
when: "cnc_version > 4.1 and ansible_architecture == 'x86_64'"
You can check what command is running.
More, you can also set single node deployment, listing the master is enough. See more in “hosts” file.
I just follow the blog to setup tao api in two machines(one master and one node). And the installation works well.
Could you check .hosts file?
The workstation was off since your answer. I just powered it on to test your instructions and entered bash setup.sh install command inside quickstart_api_bare_metal directory. Whole process skipped almost every step it did in my last trial but got stuck at another step and thrown an error this time. Please find the output below:
I suspected that it has removed my NVIDIA Drivers but couldn’t set it up again. So I re-installed my GPU drivers using .run file, re-run the bash setup.sh install command and the result is the same.
Then I entered $ grep -r "Installing the GPU Operator on NVIDIA Cloud Native Core 6.1" to see where it is throwing the error and found out that kubernetes cluster is not available. Please find the logs below:
$ grep -r "Installing the GPU Operator on NVIDIA Cloud Native Core 6.1"
cnc/cnc-docker.yaml: - name: Installing the GPU Operator on NVIDIA Cloud Native Core 6.1
cnc/cnc-x86-install.yaml: - name: Installing the GPU Operator on NVIDIA Cloud Native Core 6.1
Then check the cnc/cnc-docker.yaml file and found the commands as follows and entered them manually
The error is not seen last time. So, could you double check and use below to resume the environment?
Please re-install GPU drivers via below steps.
$ nvidia-smi
If it is 510 driver, you can run something as below. If it is 470 or something else, just replace the 510 with 470.
sudo apt purge nvidia-driver-510
sudo apt autoremove
sudo apt autoclean
sudo apt install nvidia-driver-510
There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks