TAO5 - Kubernetes deploy Signed containers

With the new updates, new stones in the road.

With the new Helm chart can deploy the TAO5 API. Can reach to it, ping, login… blah blah.

The surprise start when some task need to load other containers.
For example the Dataset convert, try to load: 5.0.0-tf1.15.5

detectnet_v2 dataset_convert --results_dir=/shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/ --output_filename=/shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/specs/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433.protobuf  > /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433.txt 2>&1 >> /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433.txt; find /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/ -type d | xargs chmod 777; find /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/ -type f | xargs chmod 666 /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433
Toolkit status for 28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433 is 
Job Done: 28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433 Final status: Error

Try to manual pull the image inside containerd:

sudo crictl pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
FATA[0672] pulling image: rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed

Try to manually verify the images with cosign:
cosign verify --insecure-ignore-tlog --key https://api.ngc.nvidia.com/v2/catalog/containers/public-key nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

With a correct verification, but the same result trying to deploy.

Try to use kubernetes CONNAISSEUR, but worst that other solutions. Not recognize when use the public-key and alwais launch the message:
Error from server: admission webhook "connaisseur-svc.connaisseur.svc" denied the request: Failed to find signature in transparency log.

I’m not at kubernetes expert, and I continuisly fighting to try test our TAO solution, but with the last versions, we are doing back steps…

Any help will be apreciate…

Could you use below command?
$ docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

Yes, with Docker is pulling correctly the container.

If I use cosing individually in each docker, then can use this verification to launch the Pods? or is necessary the other application?

Maybe some missing configuration in the connaisseur? More extra step about the configuration will be appreciated.

Could you share which .ipynb you are running? I need to check it.
If possible, please share the .ipynb file with us as well. Thanks.

Double checked before start using the notebooks that the new version not include changes in the manage of the API. I’m using the samples notebooks:
nvidia/tao/tao-getting-started:5.0.0
Send it by PM.

Also attach TF1 run.

$ docker run nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Also attach the log when the pod of the JOB is created:

Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Pulling  26m (x90 over 17h)    kubelet  Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
  Warning  Failed   3m40s (x90 over 17h)  kubelet  Failed to pull image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5": rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed
  Warning  Failed   3m40s (x91 over 17h)  kubelet  Error: ErrImagePull

From your .ipynb, the training is ongoing. Can you share the correct .ipynb?

Strange, manually cleared the old outputs.

Re-install the the imagepullsecret and the same result.

Send to you again a new one.

Can’ believe that…

Following this magical instructions:

https://docs.nvidia.com/doca/archive/doca-v1.2/container-deployment/index.html

Extracted from a ramdom post, from a random forum… is working.

In special this point is essential…
https://docs.nvidia.com/doca/archive/doca-v1.2/container-deployment/index.html#configure-ngc-credentials-on-bluefield

EDIT: Found the post withoit answer!

Could you share more detailed steps to get it working?

Yes, of course.

Log into NGC via docker.

Start docker’s service
systemctl start docker

Login to NGC via docker (using the user’s API Key)

docker login nvcr.io
username: $oauthtoken # Yes, "$oauthtoken" is the username that should be used
password: <Your-API-Key>

The expected output should roughly include the following:

WARNING! Your password will be stored unencrypted in `/$USER/.docker/config.json`.
…
Login Succeeded

Extract the auth token from docker’s file. In this example it is located at /$USER/.docker/config.json as shown above.
The config file should look roughly like this:

{ 
    "auths": { 
        "nvcr.io": { 
            "auth": "<long hexa-decimal string. This is the token we need.>"
        } 
    } 
}

Update containerd’s configuration file at /etc/containerd/config.toml:

Remove the comments from the 3 configuration lines.
Insert your auth token as described in the line itself.
Add the nvcr.io auth token (taken from docker’s config.json file above)

...
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."nvcr.io"]
          endpoint = ["https://nvcr.io"]
      [plugins."io.containerd.grpc.v1.cri".registry.configs]
        [plugins."io.containerd.grpc.v1.cri".registry.configs."nvcr.io".auth]
          auth = "<auth token as copied from docker's config.json file>"
...

Note: The file is extremely sensitive to spaces and indentation. Please make sure to use only spaces (’ '), and to use two spaces per indentation level.

Restart the containerd service to apply your changes:
systemctl restart containerd

1 Like

Thanks a lot!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.