TAO API 5.3 : How to create experiments that leverage pretrained base_experiments from NGC?

Hello,

  • I have TAO API 5.3 deployed & running on an AWS EKS instance with T4 GPUs.

  • I’m following TAO API Object Detection workflow notebook to leverage the API for object detection tasks, making changes to train a custom yolo_v4 using backbone pretrained_object_detection:resnet18 .

  • I follow all steps as in the notebook – create experiment, upload & convert & assign datasets, assign PTM to experiment.

  • I assign the PTM nvidia/tao/pretrained_object_detection:resnet18 , which has experiment ID 45a2921e-064d-58d3-9579-17f5326ba5db , to my new experiment.

  • This all works fine via the API so far.

However, when I make an API call to start training and view the training job’s logs, I see in the logs:

Base Experiment file for ID 45a2921e-064d-58d3-9579-17f5326ba5db is not found
  • This ID is the PTM ID corresponding to NGC PTM nvidia/tao/pretrained_object_detection:resnet18. When I retrieve this base experiment’s JSON and examine its fields, I notice:
 base_experiment: []
 base_experiment_pull_complete: "not_present"

Before training, i’ve fetched my custom-created experiment’s spec & confirmed that my experiment has the resnet18 PTM assigned, i.e :

 base_experiment: ['45a2921e-064d-58d3-9579-17f5326ba5db']

Question #1 : Do I have to somehow pull the resnet18 backbone from NGC registry to my AWS server (where TAO API is deployed), in order for training to work? I know in the docker container version of TAO you can download via ngc registry model download-version nvidia/tao/pretrained_object_detection:resnet18 – is there an equivalent API call I must do, to make the resnet18 backbone weights available to user-created experiments?

Question #2 : My understand is that I’d need to repeat this for every pretrained model we’d like to use from NGC – is this correct?

  • Also, I tried using the /experiments:base API endpoint to “List Experiments that can be used for transfer learning”, and want to see if the resnet18 backbone is present there. But this endpoint gives a 404 error.

  • Endpoint is https://<TAO_IP>/tao-api/api/v1/users/<user_id>/experiments:base

  • Error response is:

<Response [404]>

{'code': 404, 'name': 'Not Found', 'description': 
'The requested URL was not found on the server. 
If you entered the URL manually please check your spelling and try again.'}

Question #3 : Is this a known bug with the /experiments:base API endpoint and/or is this endpoint still supported in TAO 5.3?

Many thanks!!

Also, here is the full JSON response from querying the training job. You can see the job status remains Pending:

<Response [200]>
{'action': 'train', 'created_on': '2024-10-16T23:43:13.219758', 'description': '', 'experiment_id': '892d15bf-7b4c-414e-b48e-9fe84cb37f0a', 'id': '87508c2c-9539-4d02-81c8-da122f548a5a', 'last_modified': '2024-10-16T23:43:25.645032', 'name': '', 'parent_id': '667550d3-00c1-42ed-9fcd-8b82ccaec5ab', 'result': {'detailed_status': {'message': 'Base Experiment file for ID 45a2921e-064d-58d3-9579-17f5326ba5db is not found'}}, 'specs': {'augmentation_config': {'exposure': 1.5, 'horizontal_flip': 0.5, 'hue': 0.1, 'jitter': 0.3, 'mosaic_min_ratio': 0.2, 'mosaic_prob': 0.5, 'output_channel': 3, 'output_height': 736, 'output_width': 1280, 'randomize_input_shape_period': 0, 'saturation': 1.5, 'vertical_flip': 0}, 'dataset_config': {'image_extension': 'png', 'include_difficult_in_training': False, 'is_monochrome': False, 'target_class_mapping': [{'key': 'plane', 'value': 'plane'}], 'type': 'kitti', 'validation_fold': 0}, 'eval_config': {'average_precision_mode': '__SAMPLE__', 'batch_size': 4, 'matching_iou_threshold': 0.5}, 'gpus': 1, 'nms_config': {'clustering_iou_threshold': 0.5, 'confidence_threshold': 0.001, 'force_on_cpu': True, 'top_k': 200}, 'random_seed': 42, 'training_config': {'batch_size_per_gpu': 8, 'checkpoint_interval': 10, 'enable_qat': False, 'learning_rate': {'soft_start_annealing_schedule': {'annealing': 0.5, 'max_learning_rate': 0.0001, 'min_learning_rate': 1e-06, 'soft_start': 0.1}}, 'max_queue_size': 3, 'model_ema': False, 'n_workers': 4, 'num_epochs': 10, 'optimizer': {'adam': {'amsgrad': False, 'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-07}}, 'regularizer': {'type': '__L1__', 'weight': 3e-05}, 'use_multiprocessing': False}, 'use_amp': False, 'version': 1, 'yolov4_config': {'arch': 'resnet', 'big_anchor_shape': '[(114.94,60.67),(159.06,114.59),(297.59,176.38)]', 'big_grid_xy_extend': 0.05, 'box_matching_iou': 0.25, 'force_relu': False, 'freeze_bn': False, 'label_smoothing': 0, 'loss_class_weights': 1, 'loss_loc_weight': 1, 'loss_neg_obj_weights': 1, 'matching_neutral_box_iou': 0.5, 'mid_anchor_shape': '[(42.99,31.91),(79.57,31.75),(56.80,56.93)]', 'mid_grid_xy_extend': 0.1, 'nlayers': 18, 'small_anchor_shape': '[(15.60,13.88),(30.25,20.25),(20.67,49.63)]', 'small_grid_xy_extend': 0.2}}, 'status': 'Pending'}

It is not needed.

If using different pretrained model, please assign corresponding pretrained models as you did in the notebook cell.

Could you please share your notebook?

Thanks –

It is not needed.

Any clues as to why the pretrained resnet18’s base experiment is empty / “not_present” by default, then? I assigned pretrained resnet18 to my experiment as in the notebook, but feel like I am missing some step – hence the error.

Attaching retrieved specs for both pretrained resnet18 and my custom experiment (which has resnet18 assigned), to show the difference.


I’ll share notebook code later today when I have access. It makes the exact same REST API calls as the sample notebook, only difference is I use yolo_v4 instead of detectnet_v2 for the model name.

Do you mean the base_experiment is not empty before training, but when you start training, it will be empty?

It’s empty both before and after I start training.

So, may I know how did you get this screenshot? Do you mean it can work with default notebook which is setting detectnet_v2 PTM?

  • Via the REST API, I created a new experiment, assigned datasets to it, and assigned the nvidia/tao/pretrained_object_detection:resnet18 PTM to this experiment. That is why base_experiment in the screenshot is 45a2921e-064d-58d3-9579-17f5326ba5db, that is the PTM ID for pretrained_object_detection:resnet18.

  • The screenshot is the retrieved experiment spec of my custom experiment, once PTM/datasets are assigned.

  • Training also doesn’t work when assigning detectnet_V2 PTM to a user-created experiment – exact same error, status Pending in logs, and same behavior where detectnet_v2:resnet18’s base_experiment is empty and “not_present” (both before and after “train” action is POSTed via API). So it seems not unique to yolo_v4.

Please check
$ kubectl get pods

Also, could you please share the logs for “pending” pod ?
Other logs are also appreciated. Such as
$ kubectl log -f tao-toolkit-api-workflow-xxx
$ kubectl log -f tao-toolkit-api-app-pod-xxx
$ kubectl describe pod xxx
$ kubectl describe pod tao-toolkit-api-workflow-xxx
$ kubectl describe pod tao-toolkit-api-app-pod-xxx

I’ll have my colleague run those kubectl commands in a few hours and share output with you, thank you!

FYI “pending” was referring to the status of the TAO training job – i.e. 'status': 'Pending' (I believe job statuses can be Pending, Running, or Done?) – not pending status on any of the pods. See earlier comment with full training log JSON.

My colleague actually fixed the issue. He re-deployed TAO API, after fixing a CronJob error we were seeing before, where pulling an image from ngcr.io was failing because of a missing secret. This in turn was causing TAO to fail to pull the resnet18 backbone using the assigned PTM’s ngc_path.

Training is running now.

Thanks for your help!

Thanks for the info. Glad to know the training is running now.

@Morganh reopening this to ask a follow-up question: How would this work with offline training, meaning if wanted to deploy TAO API to bare-metal hardware that isn’t internet-connected?

In that case we wouldn’t have network access to NGC registry to pull the pretrained networks. Could we pull them all in advance, store them on hardware where TAO is deployed, then reference them locally (instead of using ngc_path ) at PTM assignment + training time?

Officially, there is not guide to do training without network. You can try some experiments to check/enable it. When you run current training, I think the PTM is available in local path somewhere. Please try to check and find it. Then, you can check how to make it work in all cells.

1 Like

Got it - thank you. I was looking at this tutorial and trying to determine if it’s still up to date (though, it uses docker instead of TAO API, so perhaps not relevant for my use case).

Yes, it is not for TAO-API. It is for TAO docker or TAO launcher.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.