TAO API 5.3 : How to create experiments that leverage pretrained base_experiments from NGC?

shaantam · October 16, 2024, 9:50pm

Hello,

I have TAO API 5.3 deployed & running on an AWS EKS instance with T4 GPUs.
I’m following TAO API Object Detection workflow notebook to leverage the API for object detection tasks, making changes to train a custom yolo_v4 using backbone pretrained_object_detection:resnet18 .
I follow all steps as in the notebook – create experiment, upload & convert & assign datasets, assign PTM to experiment.
I assign the PTM nvidia/tao/pretrained_object_detection:resnet18 , which has experiment ID 45a2921e-064d-58d3-9579-17f5326ba5db , to my new experiment.
This all works fine via the API so far.

However, when I make an API call to start training and view the training job’s logs, I see in the logs:

Base Experiment file for ID 45a2921e-064d-58d3-9579-17f5326ba5db is not found

This ID is the PTM ID corresponding to NGC PTM nvidia/tao/pretrained_object_detection:resnet18. When I retrieve this base experiment’s JSON and examine its fields, I notice:

 base_experiment: []
 base_experiment_pull_complete: "not_present"

Before training, i’ve fetched my custom-created experiment’s spec & confirmed that my experiment has the resnet18 PTM assigned, i.e :

 base_experiment: ['45a2921e-064d-58d3-9579-17f5326ba5db']

Question #1 : Do I have to somehow pull the resnet18 backbone from NGC registry to my AWS server (where TAO API is deployed), in order for training to work? I know in the docker container version of TAO you can download via ngc registry model download-version nvidia/tao/pretrained_object_detection:resnet18 – is there an equivalent API call I must do, to make the resnet18 backbone weights available to user-created experiments?

Question #2 : My understand is that I’d need to repeat this for every pretrained model we’d like to use from NGC – is this correct?

Also, I tried using the /experiments:base API endpoint to “List Experiments that can be used for transfer learning”, and want to see if the resnet18 backbone is present there. But this endpoint gives a 404 error.
Endpoint is https://<TAO_IP>/tao-api/api/v1/users/<user_id>/experiments:base
Error response is:

<Response [404]>

{'code': 404, 'name': 'Not Found', 'description': 
'The requested URL was not found on the server. 
If you entered the URL manually please check your spelling and try again.'}

Question #3 : Is this a known bug with the /experiments:base API endpoint and/or is this endpoint still supported in TAO 5.3?

Many thanks!!

shaantam · October 16, 2024, 11:09pm

Also, here is the full JSON response from querying the training job. You can see the job status remains Pending:

<Response [200]>
{'action': 'train', 'created_on': '2024-10-16T23:43:13.219758', 'description': '', 'experiment_id': '892d15bf-7b4c-414e-b48e-9fe84cb37f0a', 'id': '87508c2c-9539-4d02-81c8-da122f548a5a', 'last_modified': '2024-10-16T23:43:25.645032', 'name': '', 'parent_id': '667550d3-00c1-42ed-9fcd-8b82ccaec5ab', 'result': {'detailed_status': {'message': 'Base Experiment file for ID 45a2921e-064d-58d3-9579-17f5326ba5db is not found'}}, 'specs': {'augmentation_config': {'exposure': 1.5, 'horizontal_flip': 0.5, 'hue': 0.1, 'jitter': 0.3, 'mosaic_min_ratio': 0.2, 'mosaic_prob': 0.5, 'output_channel': 3, 'output_height': 736, 'output_width': 1280, 'randomize_input_shape_period': 0, 'saturation': 1.5, 'vertical_flip': 0}, 'dataset_config': {'image_extension': 'png', 'include_difficult_in_training': False, 'is_monochrome': False, 'target_class_mapping': [{'key': 'plane', 'value': 'plane'}], 'type': 'kitti', 'validation_fold': 0}, 'eval_config': {'average_precision_mode': '__SAMPLE__', 'batch_size': 4, 'matching_iou_threshold': 0.5}, 'gpus': 1, 'nms_config': {'clustering_iou_threshold': 0.5, 'confidence_threshold': 0.001, 'force_on_cpu': True, 'top_k': 200}, 'random_seed': 42, 'training_config': {'batch_size_per_gpu': 8, 'checkpoint_interval': 10, 'enable_qat': False, 'learning_rate': {'soft_start_annealing_schedule': {'annealing': 0.5, 'max_learning_rate': 0.0001, 'min_learning_rate': 1e-06, 'soft_start': 0.1}}, 'max_queue_size': 3, 'model_ema': False, 'n_workers': 4, 'num_epochs': 10, 'optimizer': {'adam': {'amsgrad': False, 'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-07}}, 'regularizer': {'type': '__L1__', 'weight': 3e-05}, 'use_multiprocessing': False}, 'use_amp': False, 'version': 1, 'yolov4_config': {'arch': 'resnet', 'big_anchor_shape': '[(114.94,60.67),(159.06,114.59),(297.59,176.38)]', 'big_grid_xy_extend': 0.05, 'box_matching_iou': 0.25, 'force_relu': False, 'freeze_bn': False, 'label_smoothing': 0, 'loss_class_weights': 1, 'loss_loc_weight': 1, 'loss_neg_obj_weights': 1, 'matching_neutral_box_iou': 0.5, 'mid_anchor_shape': '[(42.99,31.91),(79.57,31.75),(56.80,56.93)]', 'mid_grid_xy_extend': 0.1, 'nlayers': 18, 'small_anchor_shape': '[(15.60,13.88),(30.25,20.25),(20.67,49.63)]', 'small_grid_xy_extend': 0.2}}, 'status': 'Pending'}

Morganh · October 17, 2024, 9:17am

It is not needed.

If using different pretrained model, please assign corresponding pretrained models as you did in the notebook cell.

Could you please share your notebook?

shaantam · October 17, 2024, 9:40am

Thanks –

It is not needed.

Any clues as to why the pretrained resnet18’s base experiment is empty / “not_present” by default, then? I assigned pretrained resnet18 to my experiment as in the notebook, but feel like I am missing some step – hence the error.

Attaching retrieved specs for both pretrained resnet18 and my custom experiment (which has resnet18 assigned), to show the difference.

I’ll share notebook code later today when I have access. It makes the exact same REST API calls as the sample notebook, only difference is I use yolo_v4 instead of detectnet_v2 for the model name.

Morganh · October 17, 2024, 9:49am

shaantam:

However, when I make an API call to start training and view the training job’s logs, I see in the logs:
Base Experiment file for ID 45a2921e-064d-58d3-9579-17f5326ba5db is not found
This ID is the PTM ID corresponding to NGC PTM nvidia/tao/pretrained_object_detection:resnet18. When I retrieve this base experiment’s JSON and examine its fields, I notice:
 base_experiment: []
 base_experiment_pull_complete: "not_present"
Before training, i’ve fetched my custom-created experiment’s spec & confirmed that my experiment has the resnet18 PTM assigned, i.e :
 base_experiment: ['45a2921e-064d-58d3-9579-17f5326ba5db']

Do you mean the base_experiment is not empty before training, but when you start training, it will be empty?

shaantam · October 17, 2024, 9:50am

It’s empty both before and after I start training.

Morganh · October 17, 2024, 9:58am

So, may I know how did you get this screenshot? Do you mean it can work with default notebook which is setting detectnet_v2 PTM?

shaantam · October 17, 2024, 10:21am

Via the REST API, I created a new experiment, assigned datasets to it, and assigned the nvidia/tao/pretrained_object_detection:resnet18 PTM to this experiment. That is why base_experiment in the screenshot is 45a2921e-064d-58d3-9579-17f5326ba5db, that is the PTM ID for pretrained_object_detection:resnet18.
The screenshot is the retrieved experiment spec of my custom experiment, once PTM/datasets are assigned.
Training also doesn’t work when assigning detectnet_V2 PTM to a user-created experiment – exact same error, status Pending in logs, and same behavior where detectnet_v2:resnet18’s base_experiment is empty and “not_present” (both before and after “train” action is POSTed via API). So it seems not unique to yolo_v4.

Morganh · October 17, 2024, 10:27am

Please check
$ kubectl get pods

Also, could you please share the logs for “pending” pod ?
Other logs are also appreciated. Such as
$ kubectl log -f tao-toolkit-api-workflow-xxx
$ kubectl log -f tao-toolkit-api-app-pod-xxx
$ kubectl describe pod xxx
$ kubectl describe pod tao-toolkit-api-workflow-xxx
$ kubectl describe pod tao-toolkit-api-app-pod-xxx

shaantam · October 17, 2024, 10:46am

I’ll have my colleague run those kubectl commands in a few hours and share output with you, thank you!

FYI “pending” was referring to the status of the TAO training job – i.e. 'status': 'Pending' (I believe job statuses can be Pending, Running, or Done?) – not pending status on any of the pods. See earlier comment with full training log JSON.

shaantam · October 17, 2024, 10:48pm

My colleague actually fixed the issue. He re-deployed TAO API, after fixing a CronJob error we were seeing before, where pulling an image from ngcr.io was failing because of a missing secret. This in turn was causing TAO to fail to pull the resnet18 backbone using the assigned PTM’s ngc_path.

Training is running now.

Thanks for your help!

Morganh · October 18, 2024, 2:30am

Thanks for the info. Glad to know the training is running now.

shaantam · October 18, 2024, 8:08pm

@Morganh reopening this to ask a follow-up question: How would this work with offline training, meaning if wanted to deploy TAO API to bare-metal hardware that isn’t internet-connected?

In that case we wouldn’t have network access to NGC registry to pull the pretrained networks. Could we pull them all in advance, store them on hardware where TAO is deployed, then reference them locally (instead of using ngc_path ) at PTM assignment + training time?

Morganh · October 21, 2024, 7:36am

Officially, there is not guide to do training without network. You can try some experiments to check/enable it. When you run current training, I think the PTM is available in local path somewhere. Please try to check and find it. Then, you can check how to make it work in all cells.

shaantam · October 22, 2024, 9:54pm

Got it - thank you. I was looking at this tutorial and trying to determine if it’s still up to date (though, it uses docker instead of TAO API, so perhaps not relevant for my use case).

github.com

NVIDIA-AI-IOT/tao_toolkit_recipes/blob/main/tao_training_without_network/Guide.md

# Guide 
This guide helps run training without network.

## Step
In the 1st machine which can connect to internet, assume the training data locates at below path.

`/home/username/`

Run below steps.
```
$ docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04 /bin/bash

# mkdir /home/username    (Note: make sure the username is the same as above)

# apt-get update

# apt-get -y install python3-pip unzip vim

# pip3 install --ignore-installed --no-cache-dir pip

This file has been truncated. show original

Morganh · October 23, 2024, 1:40am

Yes, it is not for TAO-API. It is for TAO docker or TAO launcher.

system · November 6, 2024, 1:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error in TAO-Toolkit while training TAO Toolkit	15	1505	July 6, 2022
Tao toolkit Error while fetching server API version TAO Toolkit	19	1879	June 15, 2023
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1684	July 10, 2023
Errors during training in TAO TAO Toolkit	3	389	January 6, 2024
TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start TAO Toolkit	46	969	August 3, 2023
TAO Toolkit - FPENet - Dataset_Convert error TAO Toolkit	14	716	October 6, 2023
Tao toolkit observations TAO Toolkit	56	912	May 29, 2024
TAO Toolkit with Yolov4-Tiny and custom pretrained model TAO Toolkit	30	1106	June 26, 2023
Detectnet_v2 training core dumped error TAO Toolkit tensorrt , tensorflow , deep-learning , tao	24	1081	June 21, 2022
TAO 4.0 AutoML Detectnet_V2 KeyError on training step TAO Toolkit	19	673	July 15, 2023

TAO API 5.3 : How to create experiments that leverage pretrained base_experiments from NGC?

Related topics