How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes?

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc): GeForce RTX 3090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): TAO 4.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi All,

I would like to deploy TAO 4.0 and utilize the new AutoML capabilities in standalone mode without needing to have master-slave node structure. I know that jupyter notebooks that are provided within the Getting Started Guide explains the baremetal deployment but still it looks like it forces me to deploy the whole solution using Ansible and Kubernetes. Is there any other way to deploy AutoML services and make it work with the TAO Toolkit without needing Ansible and Kubernetes?

It is not supported, as AutoML is a service on top of TAO REST API.
More info can be found in AutoML — TAO Toolkit 4.0 documentation

Then don’t you think that there should be a modified script for local deployments which doesn’t wipe out existing NVIDIA driver, docker volumes etc.?

I know that I can eliminate these sort of things if I go all the way down to the yaml files, test out ansible commands individually and comment out specific parts but it still would be a pain for starters.

1 Like

You can refer to the steps in https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/
Please note that the “one-click deploy tar” file is actually available after you run “ngc registry resource download-version “nvidia/tao/tao-getting-started:4.0.0”” .

Yeah, a good manual to install in a local host, and lost few working days, trying to reconfigure my computer again…

Edit: 3 working days lost, trying to configure all manually, and giving alwais the same result.

1 Like

I tried to test the “one-click-deploy” file by following the steps from the tutorial: https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/

However, I’m stuck at “Waiting for the Cluster to become available” step and it is not clear how can I troubleshoot the problem that is preventing the deployment. I also created a new Ubuntu user and added it to sudoers file so that TAO API Deployment process wouldn’t need to asked for a sudo pass. What is the right way to debug/troubleshoot this issue?

1 Like

Usually you can grep the log to find where gets stucked.
For “Waiting for the Cluster to become available”, running below can find the information.
$ grep -r “Waiting for the Cluster to become available”
cnc/cnc-validation.yaml: - name: Waiting for the Cluster to become available

Then, check the file cnc/cnc-validation.yaml.

- name: Waiting for the Cluster to become available
  args:
    executable: /bin/bash
  shell: |
    state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v 'Running|Completed|NAME' | wc -l)
    while [ $state != 0 ]
      do
        sleep 10
        state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v 'Running|Completed|NAME' | wc -l)
      done
  register: status
  when: "cnc_version > 4.1 and ansible_architecture == 'x86_64'"

You can check what command is running.

More, you can also set single node deployment, listing the master is enough. See more in “hosts” file.

I just follow the blog to setup tao api in two machines(one master and one node). And the installation works well.
Could you check .hosts file?

Can you open a new terminal to run
$ kubectl get pods

And then check some pods via
$ kubectl describe pods <pod name>
$ kubectl logs <pod name>

The workstation was off since your answer. I just powered it on to test your instructions and entered bash setup.sh install command inside quickstart_api_bare_metal directory. Whole process skipped almost every step it did in my last trial but got stuck at another step and thrown an error this time. Please find the output below:

I suspected that it has removed my NVIDIA Drivers but couldn’t set it up again. So I re-installed my GPU drivers using .run file, re-run the bash setup.sh install command and the result is the same.

Then I entered $ grep -r "Installing the GPU Operator on NVIDIA Cloud Native Core 6.1" to see where it is throwing the error and found out that kubernetes cluster is not available. Please find the logs below:

$ grep -r "Installing the GPU Operator on NVIDIA Cloud Native Core 6.1"
cnc/cnc-docker.yaml:   - name: Installing the GPU Operator on NVIDIA Cloud Native Core 6.1
cnc/cnc-x86-install.yaml:   - name: Installing the GPU Operator on NVIDIA Cloud Native Core 6.1

Then check the cnc/cnc-docker.yaml file and found the commands as follows and entered them manually

...
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
helm repo update
helm install --version 1.10.1 --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.enabled=false,toolkit.enabled=false --wait --generate-name
...

I believe running Kubernetes cluster was a task of the automation process. How can I resolve this?

Thanks

The error is not seen last time. So, could you double check and use below to resume the environment?
Please re-install GPU drivers via below steps.

  1. $ nvidia-smi
  2. If it is 510 driver, you can run something as below. If it is 470 or something else, just replace the 510 with 470.
    sudo apt purge nvidia-driver-510
    sudo apt autoremove
    sudo apt autoclean
    sudo apt install nvidia-driver-510

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Please run below workaround when issue happens.
$ kubectl delete crd clusterpolicies.nvidia.com