TAO5 - Kubernetes deploy Signed containers

alejandro.granda · July 24, 2023, 12:20pm

With the new updates, new stones in the road.

With the new Helm chart can deploy the TAO5 API. Can reach to it, ping, login… blah blah.

The surprise start when some task need to load other containers.
For example the Dataset convert, try to load: 5.0.0-tf1.15.5

detectnet_v2 dataset_convert --results_dir=/shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/ --output_filename=/shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/specs/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433.protobuf  > /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433.txt 2>&1 >> /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433.txt; find /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/ -type d | xargs chmod 777; find /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/ -type f | xargs chmod 666 /shared/users/xxx/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433
Toolkit status for 28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433 is 
Job Done: 28a8bb09-cdeb-4fa0-9f6e-9abbe1c29433 Final status: Error

Try to manual pull the image inside containerd:

sudo crictl pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
FATA[0672] pulling image: rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed

Try to manually verify the images with cosign:
cosign verify --insecure-ignore-tlog --key https://api.ngc.nvidia.com/v2/catalog/containers/public-key nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

With a correct verification, but the same result trying to deploy.

Try to use kubernetes CONNAISSEUR, but worst that other solutions. Not recognize when use the public-key and alwais launch the message:
Error from server: admission webhook "connaisseur-svc.connaisseur.svc" denied the request: Failed to find signature in transparency log.

I’m not at kubernetes expert, and I continuisly fighting to try test our TAO solution, but with the last versions, we are doing back steps…

Any help will be apreciate…

Morganh · July 24, 2023, 4:42pm

Could you use below command?
$ docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

alejandro.granda · July 25, 2023, 6:20am

Yes, with Docker is pulling correctly the container.

If I use cosing individually in each docker, then can use this verification to launch the Pods? or is necessary the other application?

Maybe some missing configuration in the connaisseur? More extra step about the configuration will be appreciated.

Morganh · July 25, 2023, 6:48am

Could you share which .ipynb you are running? I need to check it.
If possible, please share the .ipynb file with us as well. Thanks.

alejandro.granda · July 25, 2023, 7:24am

Double checked before start using the notebooks that the new version not include changes in the manage of the API. I’m using the samples notebooks:
nvidia/tao/tao-getting-started:5.0.0
Send it by PM.

Also attach TF1 run.

$ docker run nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Also attach the log when the pod of the JOB is created:

Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Pulling  26m (x90 over 17h)    kubelet  Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5"
  Warning  Failed   3m40s (x90 over 17h)  kubelet  Failed to pull image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5": rpc error: code = Unknown desc = failed to pull and unpack image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed
  Warning  Failed   3m40s (x91 over 17h)  kubelet  Error: ErrImagePull

Morganh · July 25, 2023, 8:23am

From your .ipynb, the training is ongoing. Can you share the correct .ipynb?

alejandro.granda · July 25, 2023, 8:26am

Strange, manually cleared the old outputs.

Re-install the the imagepullsecret and the same result.

Send to you again a new one.

alejandro.granda · July 25, 2023, 8:59am

Can’ believe that…

Following this magical instructions:

https://docs.nvidia.com/doca/archive/doca-v1.2/container-deployment/index.html

Extracted from a ramdom post, from a random forum… is working.

In special this point is essential…
https://docs.nvidia.com/doca/archive/doca-v1.2/container-deployment/index.html#configure-ngc-credentials-on-bluefield

EDIT: Found the post withoit answer!

Morganh · July 31, 2023, 9:31am

Could you share more detailed steps to get it working?

alejandro.granda · July 31, 2023, 10:39am

Yes, of course.

Log into NGC via docker.

Start docker’s service
systemctl start docker

Login to NGC via docker (using the user’s API Key)

docker login nvcr.io
username: $oauthtoken # Yes, "$oauthtoken" is the username that should be used
password: <Your-API-Key>

The expected output should roughly include the following:

WARNING! Your password will be stored unencrypted in `/$USER/.docker/config.json`.
…
Login Succeeded

Extract the auth token from docker’s file. In this example it is located at /$USER/.docker/config.json as shown above.
The config file should look roughly like this:

{ 
    "auths": { 
        "nvcr.io": { 
            "auth": "<long hexa-decimal string. This is the token we need.>"
        } 
    } 
}

Update containerd’s configuration file at /etc/containerd/config.toml:

Remove the comments from the 3 configuration lines.
Insert your auth token as described in the line itself.
Add the nvcr.io auth token (taken from docker’s config.json file above)

...
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."nvcr.io"]
          endpoint = ["https://nvcr.io"]
      [plugins."io.containerd.grpc.v1.cri".registry.configs]
        [plugins."io.containerd.grpc.v1.cri".registry.configs."nvcr.io".auth]
          auth = "<auth token as copied from docker's config.json file>"
...

Note: The file is extremely sensitive to spaces and indentation. Please make sure to use only spaces (’ '), and to use two spaces per indentation level.

Restart the containerd service to apply your changes:
systemctl restart containerd

Morganh · July 31, 2023, 3:15pm

Thanks a lot!

system · August 15, 2023, 1:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nvidia TAO tookit on AWS vm TAO Toolkit	18	282	July 9, 2024
TAO Data services jupyter notebook permission error TAO Toolkit	12	735	August 16, 2023
Docker - No such container TAO Toolkit	5	49	February 3, 2025
New computer install GPU Docker error TAO Toolkit	6	1991	September 12, 2023
Tao toolkit Error while fetching server API version TAO Toolkit	19	1880	June 15, 2023
TAO data services Error response from daemon: No such container dataset convert error from kitti to COCO TAO Toolkit	14	431	June 11, 2024
Tao model error TAO Toolkit	9	101	October 21, 2024
Login issue JWT on TAO-API with Jupyter TAO Toolkit	13	679	August 4, 2023
TAO 5.0.0. TF1 Container fail to run tao model yolo_v4 dataset_convert command TAO Toolkit	4	353	October 5, 2023
docker.errors.ImageNotFound after follow "nvidia/tao/cv_samples:v1.4.1" TAO Toolkit	12	452	November 13, 2022

TAO5 - Kubernetes deploy Signed containers

Related topics