Install TAO bare metal fail

I want to install tao 5.5.0 api with quickstart_api_bare_metal. However I encounter the following error when I run setup.sh.

fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": "helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-api-5.5.0 --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=<my-key>", "delta": "0:00:00.789794", "end": "2024-11-12 18:47:09.835907", "msg": "non-zero return code", "rc": 1, "start": "2024-11-12 18:47:09.046113", "stderr": "Error: failed to fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-api-5.5.0 : 500 Internal Server Error", "stderr_lines": ["Error: failed to fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-api-5.5.0 : 500 Internal Server Error"], "stdout": "Release \"tao-api\" does not exist. Installing it now.", "stdout_lines": ["Release \"tao-api\" does not exist. Installing it now."]}

My OS is Ubuntu 22.04 with 2080 Ti
K8s version is 1.27.6

There is known issue in the page Setup - NVIDIA Docs.

We already fix it but have not refreshed it yet.

Please refer to below.

Deployment Steps

  1. Download the necessary software using the NGC CLI.
git clone https://github.com/NVIDIA/tao_tutorials.git

git checkout v5.5.0

  1. Change current directory:
cd setup/quickstart_api_bare_metal
  1. Setup proxy and custom CA certificates.

  2. If applicable, make sure your deployment machine is set with Internet access.

  3. Make sure the following environment variables are properly set:

  • HTTP_PROXY, HTTPS_PROXY
  • http_proxy, HTTPS_PROXY
  • NO_PROXY
  1. If you are using a custom CA SSL Certificate, you need to copy the certificate bundle locally:
cp <path>/<certificat bundle file>.crt ./my-cert.crt

The remote node users must have sudo privileges.
6. Execute the following each node (the following example assumes an Ubuntu user):

sudo echo "ubuntu ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
  1. Add content to your inventory.
vi hosts
  1. Use either use a password (``ansible_ssh_pass) or SSH private key file (ansible_ssh_private_key_file`) for credentials.
  2. The following is an example with user/password credentials.
[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'
[nodes]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'
  1. The following is an example with an SSH key. You can generate a local SSH key using ssh-keygen, then populate your public key to the remote node(s) using ssh-copy-id.
[master]

1.1.1.1 ansible_ssh_user=‘ubuntu’ ansible_ssh_private_key_file=‘/home/user/.ssh/id_rsa’
[nodes]
1.1.1.2 ansible_ssh_user=‘ubuntu’ ansible_ssh_private_key_file=‘/home/user/.ssh/id_rsa’

  1. Use the following command to validate the SSH credentials for the remote node(s). A proper response would be “root”.
ssh ubuntu@127.0.0.2 'sudo whoami'
  1. Set your deployment parameters, such as chart version, NGC credentials, etc.
vi deploy.yml

Below is an example.

ngc_api_key: YzZtczM5amdtdDcwNjk...
ngc_email: johndoe@mycorp.com
chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.5.0.tgz
chart_values: ./tao-toolkit-api-helm-values.yml
cluster_name: tao-api-demo
  1. Optionally, you can add any values that you would like to override while installing the API chart. This is an uncommon use case.
vi tao-toolkit-api-helm-values.yml
  1. Proceed with deployment.
bash setup.sh install

If you need to completely remove the installed Kubernetes services from your machine, you can use the following command.

bash setup.sh uninstall

Thank for your help. It do have some progress. However I encounter another error.

fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": "helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.5.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=<my-token>", "delta": "0:05:03.056649", "end": "2024-11-14 09:02:57.118460", "msg": "non-zero return code", "rc": 1, "start": "2024-11-14 08:57:54.061811", "stderr": "W1114 08:57:56.750101 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750126 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750185 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750195 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750226 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750238 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750322 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750330 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750444 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW1114 08:57:56.750527 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nError: release tao-api failed, and has been uninstalled due to atomic being set: context deadline exceeded", "stderr_lines": ["W1114 08:57:56.750101 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750126 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750185 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750195 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750226 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750238 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750322 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750330 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750444 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W1114 08:57:56.750527 1852114 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "Error: release tao-api failed, and has been uninstalled due to atomic being set: context deadline exceeded"], "stdout": "Release \"tao-api\" does not exist. Installing it now.", "stdout_lines": ["Release \"tao-api\" does not exist. Installing it now."]}

Here is my deploy.yml

name: 'AI-Training-PC'
spec:
  cns:
    enable_mig: no
    mig_profile: all-disabled
    mig_strategy: single
    gpu_driver_version: "535.161.08"
    # will override existing drivers if present
    install_driver: false
  tao:
    ngc_api_key: <my-key>
    ngc_email: <my-email>
    chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.5.0.tgz
    chart_values: |
      ---
    cluster_name: tao-api-demo

Could you share the full log? Thanks.

Here is the full log
tao_install.txt (96.2 KB)

Could you please search which file contains “kubernetes.io/ingress.class” ?

kubernetes.io/ingress.class is inside tao-toolkit-api-5.5.0.tgz. Here is the
ingress-login.yaml

# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ .Release.Name }}-ingress-login
  namespace: {{ .Release.Namespace }}
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/proxy-buffer-size: 128k
    nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
    nginx.ingress.kubernetes.io/connection-proxy-header: ""
{{- if .Values.tlsSecret }}
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/server-snippet: |
      error_page 497 https://$server_name:$server_port$request_uri;
{{- end }}
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /api/v1/login
{{- if .Values.corsOrigin }}
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: {{ .Values.corsOrigin }}
{{- end }}
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-Frame-Options: SAMEORIGIN";
      more_set_headers "X-XSS-Protection: 1; mode=block";
{{- if .Values.tlsSecret }}
      more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains; preload";
{{- end }}
spec:
{{- if .Values.tlsSecret }}
  tls:
  - secretName: {{ .Values.tlsSecret }}
{{- if .Values.host }}
    hosts:
    - {{ .Values.host }}
{{- end }}
{{- end }}
  rules:
  - http:
      paths:
      - path: /{{ .Release.Namespace }}/api/v1/login
        pathType: Prefix
        backend:
          service:
            name: {{ .Release.Name }}-service
            port:
              number: 8000
{{- if eq .Release.Namespace "default" }}
      - path: /api/v1/login
        pathType: Prefix
        backend:
          service:
            name: {{ .Release.Name }}-service
            port:
              number: 8000
{{- end }}
{{- if and .Values.host }}
    host: {{ .Values.host }}
{{- end }}