We have deployed NVIDIA TAO 5.3 API on an EKS cluster and successfully trained various models.
I noticed in the TAO source code (and kubectl logs) that base experiment files are pulled from NGC at the time of PTM attachment to the experiment.
We would like to use TAO in a setting where there is no network internet access.
- How can we fetch pretrained networks and store them locally in advance?
- After deploying TAO in the no-network environment, is there a way to tell TAO to look for PTMs (i.e.
base_experiment
files) locally, so it doesn’t try and reach out to NGC server?
Officially, there is not guide to do training without network. You can try some experiments to check/enable it. When you run current training, I think the PTM is available in local path somewhere. Please try to check and find it. Then, you can check how to make it work in all cells.
Refer to TAO API 5.3 : How to create experiments that leverage pretrained base_experiments from NGC? - #13 by Morganh.
I will check - thanks.
Would we necessarily need to modify the TAO frontend container’s source code prior to deployment to enable this? I’d like to avoid modifying the TAO source code if possible.
Not needed. I point the link just in case you are going to debug further. The code is available inside the docker as well.
Thanks. I was concerned because in the source code there are many places where requests go out to NGC server during a base_experiment pull.
Was worried about needing to untangle all of that logic to prevent those requests from being made, so hopefully that’s not the case.
How about TAO API key authentication? How can that be done offline? Currently we issue new API keys on the NGC portal, and NGC auth would be inaccessible in an airgapped environment.
Ensure that all necessary models and datasets are downloaded before entering the airgapped environment. This way, you won’t need to access NGC during your experiments.
Thanks @Morganh . For experiments, we’ll pre-fetch any models / datasets needed.
Any advice for the API key auth issue? Can we somehow configure TAO API to use a pre-fetched JWT token and user ID, and bypass the need for internet access for auth?
It is possible. Actually two years ago, we did the same thing but for another network(multi-classification). You can refer to it.
Attach the .ipynb.
Copy_of_multiclass_classification_without_network.ipynb (692.5 KB)
You could manually download the PTM file from outside the proxy using NGC website, and copy it at the right path in the shared drive. This would avoid the internal use of NGC CLI.
The ptm_id is retrieved in your notebook when setting model experiment metadata.
The shared drive path for the downloaded PTM would be, for example:
/shared/users/00000000-0000-0000-0000-000000000000/models/<ptm_id>/pretrained_object_detection_vresnet18/resnet_18.hdf5
The filename can be anything with a .tlt or .hdf5 or .nemo extension.
You can access the shared drive inside the API pod container. Or mount it as documented in the TAO remote client notebooks.
This is great info - thanks. I experimented with this earlier, noticing that at the time of resnet PTM assignment to a TAO experiment, the base_experiment model file is pulled from NGC registry and stored in that /shared/users/000…. path on the cluster.
My next question is - to do any work with TAO API, I’ve noticed that API auth is first required. Upon login with API key, we get back a user_id and JWT token, which is then used for subsequent API calls (experiments, datasets, jobs, etc).
Of course in an airgapped setting this initial auth process isn’t possible since there’s no internet access to NGC server to retrieve a token.
Separate from the PTM issue - once offline, is there a way to bypass this TAO auth issue?
My thinking was:
- intercept the auth requests to NGC and instead produce our own valid token for API calls ( not sure how we’d do this)
- modify the TAO frontend container auth source code ( though, there are many places where NGC requests are made, so doubt this is the best way)
- generate a token by calling auth while online, then manually pass these credentials to our airgapped system & store them in the offline deployment (but then, wouldn’t the token expire / become invalid?)
What is your recommendation?
You can take a look at the above sharing notebook. It is a multi-class classification notebook for running without network.
Got it. Any idea where a sessions.yaml
file (mentioned in the notebook to store offline user_id + token) would exist in TAO 5.3 deployments, @MorganH?
I kubectl exec
into tao-api-app-pod-5fc6bf9494-fz2vx
, look in the /shared/users/
volume.
- I see other things you referenced earlier like the shared drive path for downloaded PTMs, but nothing for
sessions.yaml
– anywhere in the API pod container.
1st screenshot is from the notebook you linked.
2nd is from accessing the shared drive inside the API pod container.
Any status or logs for the pods?
Hi, Currently, the TAO API does not support running TAO API offline. I need to create a feature request for internal team.
@Morganh thanks for the confirmation. Feature request to the internal team would be great since this is a feature we would really like to use (offline auth, offline training).
In the meantime, do you have any advice on how to proceed with the authentication problem?
@Morganh - quick follow up question - does TAO Toolkit v5.3.0 CLI version (not the API) support offline use?
I’m looking at : tao_tutorials/notebooks/tao_launcher_starter_kit/yolo_v4/yolo_v4.ipynb at main · NVIDIA/tao_tutorials · GitHub
Assuming we pull the pretrained model from NGC in advance and also pull the docker containers in advance – e.g. the containers invoked bytao model yolo_v4 train
, tao model yolo_v4 evaluate
, etc – will we be able to use those containers in an offline setting, or is there is still some dependency on NGC at runtime?
Yes, previously we have a guide for it. tao_toolkit_recipes/tao_training_without_network/Guide.md at main · NVIDIA-AI-IOT/tao_toolkit_recipes · GitHub. You can refer to it. BTW, it is a bit old. You can just take a look and understand its workflow.
Thanks! Can we just use the notebook code offline (assuming desired weights + all containers are already pulled) or would that fail for some reason?
In the above-mentioned guide, you will login one tao docker. That means you are using docker run
instead of tao-launcher. This is different from notebook.
When login the docker via docker run
, you just need to start training via $yolo_v4 train xxx
instead of tao model yolo_v4 train xxx
.
Yes, it can run offline.
Thank you - @Morganh this recipe is working quite well. However, I notice this solution requires a docker container within another docker container – outer container to package up TAO dependencies + TAO backend containers, and inner container that runs when tao model xx
command is executed.
If we’d like to do offline training without nested dockers – meaning run tao model xx
at the outermost level – is there a way to build the TAO launcher container from source, given that in an offline setting, we can’t rely on installing with pip3 install nvidia-tao
?
Actually the tao-launcher is a wrapper for docker run
. You can refer to some source code from GitHub - NVIDIA/tao_launcher: Lightweight Python based CLI application to run TAO Toolkit. So, we can use docker run
to trigger any TAO training. The above-mentioned recipe does not involve tao-launcher. It only use docker run
method. You can ignore tao-launcher.