Installing nvidia TAO toolkit API

Hi,

I am trying to install the nvidia TAO tookit API (DGX was replaced so I’m back doing this :) )

I noticed that when installing the bare metal script

it checks the ubunutu version. and sets up the cnc version

 if [[ ${os} == "Ubuntu" && ${os_version} == "20.04"  ]]; then
    cp cnc/cnc_values_6.1.yaml cnc/cnc_values.yaml

this means cnc 6.1 will get copied to the cnc folder and I will have in the cnc_values.yaml

cnc_version: 6.1

then in cnc/prerequisites.yaml

    - name: Install kubernetes components for Ubuntu on NVIDIA Cloud Native Core 6.1
      become: true
      when: "cnc_version == 6.1 and ansible_distribution == 'Ubuntu' and 'running' not in k8sup.stdout"
      apt:
        name: ['apt-transport-https', 'curl', 'ca-certificates', 'gnupg-agent' ,'software-properties-common', 'kubelet=1.23.5-00', 'kubeadm=1.23.5-00', 'kubectl=1.23.5-00']
        state: present
        update_cache: true

the problem is 1.23 seem to be deprecated and removed from package archives

Hit:1 http://gb.archive.ubuntu.com/ubuntu focal InRelease
Get:2 http://gb.archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:3 http://gb.archive.ubuntu.com/ubuntu focal-backports InRelease 
Hit:4 http://gb.archive.ubuntu.com/ubuntu focal-security InRelease
Ign:5 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Err:6 https://packages.cloud.google.com/apt kubernetes-xenial Release
  404  Not Found [IP: 142.250.180.14 443]
Reading package lists... Done
E: The repository 'https://apt.kubernetes.io kubernetes-xenial Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

I have read this on migrating

but

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.23/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

gives

curl: (22) The requested URL returned error: 403 

error.

Can tou please help me.

Is there a way to get 1.23 or can i increment the cnc_version: 6.1 to 7.0 which seems to be available from the repo.

Thanks a lot,

Cheers,
Ganindu.

Thanks for the info. I will try to reproduce.

Thanks Morgan!

Seems that the installation will be stuck as below. You can see similar log, right?

local-morganh@ipp1-0631:~/forum_285780/getting_started_v5.2.0/setup/quickstart_api_bare_metal$ bash setup.sh install
...
...
TASK [Report Internet Connection status] ****************************************************************************************************************************************************
ok: [10.117.8.47] => {
    "msg": "Internet Connection status 200"
}

TASK [Install Internet Speed dependencies] **************************************************************************************************************************************************
fatal: [10.117.8.47]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: W:Updating from such a repository can't be done securely, and is therefore disabled by default., W:See apt-secure(8) manpage for repository creation and user configuration details., E:The repository 'https://apt.kubernetes.io kubernetes-xenial Release' no longer has a Release file."}

PLAY RECAP **********************************************************************************************************************************************************************************
10.117.8.47                : ok=7    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

local-morganh@ipp1-0631:~/forum_285780/getting_started_v5.2.0/setup/quickstart_api_bare_metal$
local-morganh@ipp1-0631:~/forum_285780/getting_started_v5.2.0/setup/quickstart_api_bare_metal$
local-morganh@ipp1-0631:~/forum_285780/getting_started_v5.2.0/setup/quickstart_api_bare_metal$ sudo apt-get update
Get:1 http://bootstrap.scadvs.nvidia.com/ubuntu20 focal InRelease [265 kB]
Get:2 http://bootstrap.scadvs.nvidia.com/ubuntu20 focal-backports InRelease [108 kB]
Hit:3 http://bootstrap.scadvs.nvidia.com/ubuntu20 focal-updates InRelease
Hit:4 http://bootstrap.scadvs.nvidia.com/ubuntu20 focal-security InRelease
Get:5 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  InRelease [1484 B]
Hit:6 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  InRelease
Hit:7 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease
Ign:8 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:10 http://ipp-1-u1.clouds.archive.ubuntu.com/ubuntu focal InRelease
Hit:11 http://security.ubuntu.com/ubuntu focal-security InRelease
Err:9 https://packages.cloud.google.com/apt kubernetes-xenial Release
  404  Not Found [IP: 142.251.214.142 443]
Hit:12 http://ipp-1-u1.clouds.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:13 http://ipp-1-u1.clouds.archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done
E: The repository 'https://apt.kubernetes.io kubernetes-xenial Release' no longer has a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target Packages (main/binary-i386/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target Translations (main/i18n/Translation-en) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target DEP-11 (main/dep11/Components-amd64.yml) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target DEP-11 (main/dep11/Components-all.yml) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target DEP-11-icons-small (main/dep11/icons-48x48.tar) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target DEP-11-icons (main/dep11/icons-64x64.tar) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target DEP-11-icons-hidpi (main/dep11/icons-64x64@2.tar) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target CNF (main/cnf/Commands-amd64) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target CNF (main/cnf/Commands-all) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
local-morganh@ipp1-0631:~/forum_285780/getting_started_v5.2.0/setup/quickstart_api_bare_metal$

Well that is the root cause, but I went beyind that by manually bypassing some checks

slightly modified cnc-installation.yaml

    - name: Report Internet Connection status
      failed_when: "connection.status == -1"
      debug:
        msg: "Internet Connection status {{ connection.status }}"

    - name: Install Internet Speed dependencies
      when: connection.status != '-1'
      become: true
      apt:
        name: ['speedtest-cli']
        state: present
        update_cache: yes

    - name: Check Internet Speed
      ignore_errors: true
      failed_when: false
      shell: speedtest-cli --simple
      register: speed

    - name: Report Valid Internet Speed
      shell: echo {{ speed.stdout_lines[1] }} | awk '{print $3}'
      register: speedtest
      ignore_errors: true
      failed_when: "'Kbit/s' in speedtest.stdout"

    - name: Check DNS Configuration
      shell: dig google.com +cmd +noall +answer
      register: dns
      failed_when: "dns.stdout | length < 0"

    - name: Check Google Repo access
      register: google_repo
      failed_when: "google_repo.status != 200"
      uri:
        url: https://cloud.google.com/artifact-registry/
        timeout: 5

and in cnc_values you might have

## Kubernetes apt resources
k8s_apt_key: "https://packages.cloud.google.com/apt/doc/apt-key.gpg"
k8s_apt_repository: "deb https://apt.kubernetes.io/ kubernetes-xenial main"
k8s_registry: "k8s.gcr.io"

even though you may have the key there the packages seem to be not there

That is when i tried to investigate (omit the playbook adding of repos, do it myself and let the playbook do the installation)

the is topic suggests we add trhe keys manually

in the lines below you can see that in the community repos a 1.24 key exists and i can update the debs for that

 curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.23/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
File '/etc/apt/keyrings/kubernetes-apt-keyring.gpg' exists. Overwrite? (y/N) curl: (22) The requested URL returned error: 403 
y
gpg: no valid OpenPGP data found.
g@gsrv:~$ curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.24/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
File '/etc/apt/keyrings/kubernetes-apt-keyring.gpg' exists. Overwrite? (y/N) y

this is the same guidance in the official repo migration documentation.

in the sources list i have the problematic line

deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main

the next logical solutuon seems to be adding the line (if we forget the fact thetre was no key for 1.23 for a bit)

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.23/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list

however if we do an update, we get

sudo apt update 
Hit:1 http://gb.archive.ubuntu.com/ubuntu focal InRelease
Hit:2 http://gb.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:3 http://gb.archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:4 http://gb.archive.ubuntu.com/ubuntu focal-security InRelease
Err:5 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.23/deb  InRelease
  403  Forbidden [IP: 18.164.68.17 443]
Reading package lists... Done
E: Failed to fetch https://pkgs.k8s.io/core:/stable:/v1.23/deb/InRelease  403  Forbidden [IP: 18.164.68.17 443]
E: The repository 'https://pkgs.k8s.io/core:/stable:/v1.23/deb  InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

we can also see that 1.23 is being got rof from this thread in the kubernetes github repo.

So I thought why not if I could try 1.24!!

so I changed my /etc/apt/sources.list.d/kubernetes.list

to have the line

deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.24/deb/ /

and if I do an apt update I get no problems like before

sudo apt update 
Hit:1 http://gb.archive.ubuntu.com/ubuntu focal InRelease
Hit:2 http://gb.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:3 http://gb.archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:4 http://gb.archive.ubuntu.com/ubuntu focal-security InRelease
Get:5 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.24/deb  InRelease [1,192 B]
Get:6 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.24/deb  Packages [26.5 kB]
Fetched 27.7 kB in 1s (28.4 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
31 packages can be upgraded. Run 'apt list --upgradable' to see them.

So what I am getting to here is,

  1. somehow people in k8 in heir infinite wisdom has decided to get rid of 1.23
  2. however we can still get 1.24

then comes my question, which is Iccan bypas the cheking steps as shown above in the top , maks sure the conditions for v1.24 is manually set by doing stuff below.

  1. getting the keys
  2. updating the /etc/apt/sources.list.d/kubernetes.list tp point to v1.24
  3. running apt update to make sure when i run the commands to instll k8 1.24 (or ansible does that) it is there toi be installed

then the actual question is.

by changing cnc_version: 6.1 to 7.0 or manually bodging the lines

  - name: Install kubernetes components for Ubuntu on NVIDIA Cloud Native Core 6.1
      become: true
      when: "cnc_version == 6.1 and ansible_distribution == 'Ubuntu' and 'running' not in k8sup.stdout"
      apt:
        name: ['apt-transport-https', 'curl', 'ca-certificates', 'gnupg-agent' ,'software-properties-common', 'kubelet=1.23.5-00', 'kubeadm=1.23.5-00', 'kubectl=1.23.5-00']
        state: present
        update_cache: true

we can install 1.24!

Can you check if this is advidable or will this break TAO please.

Cheers,
Ganindu.

OK, I will try 1.24 as well.
Current failure is due to The legacy Kubernetes package repositories are removed as of 2024-03-04 · Issue #3485 · kubernetes/release · GitHub, and the guide is mentioned in pkgs.k8s.io: Introducing Kubernetes Community-Owned Package Repositories | Kubernetes.

I run below steps, then $sudo apt-get update can work now.

$echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
$sudo mkdir /etc/apt/keyrings/
$curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
$sudo apt-get update

But still get stuck at installation. Will check and also sync internally, and update to you if I have. Thanks.

That is v1.28, I’m more than happy to use that, but from my memory and past trauma of trying to do that myself, I remember havinf ingress issues when launching containers. the beauty (or danger) of 1.23 it it just worked becase I think it was 1.24 wen they enforced some of the rules :P

Got it. Thanks a lot for the info!

No worries thanks a lot for helping me, this is a note I made for myself when doing that stuff, i think the validating webhooks were a roya; pain. Once I figured out ansible playbooks, I never looked back becase within one command from my pc I could install/launch the fully customised cluster. spanning across multiple machines with custom namespaces, storages etc.

maybe kubernetes folks may have fixed those issues and v1.28 is now awesome, Plese do let me know (as soon as you know) if that is ok or 1.24 is all good. (I use this for automl jobs)

Cheers,
Ganindu.

Sure, will check further and update to you. Thanks.

Hi @ganinduN
The latest version(5.3) is available.
Doc: Setup - NVIDIA Docs.
The helm 5.3 version is in NVIDIA NGC.

1 Like

Thanks a lot Morgan!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.