Hello,
-
I have TAO API 5.3 deployed & running on an AWS EKS instance with T4 GPUs.
-
I’m following TAO API Object Detection workflow notebook to leverage the API for object detection tasks, making changes to train a custom yolo_v4 using backbone pretrained_object_detection:resnet18 .
-
I follow all steps as in the notebook – create experiment, upload & convert & assign datasets, assign PTM to experiment.
-
I assign the PTM
nvidia/tao/pretrained_object_detection:resnet18
, which has experiment ID 45a2921e-064d-58d3-9579-17f5326ba5db , to my new experiment. -
This all works fine via the API so far.
However, when I make an API call to start training and view the training job’s logs, I see in the logs:
Base Experiment file for ID 45a2921e-064d-58d3-9579-17f5326ba5db is not found
- This ID is the PTM ID corresponding to NGC PTM
nvidia/tao/pretrained_object_detection:resnet18
. When I retrieve this base experiment’s JSON and examine its fields, I notice:
base_experiment: []
base_experiment_pull_complete: "not_present"
Before training, i’ve fetched my custom-created experiment’s spec & confirmed that my experiment has the resnet18 PTM assigned, i.e :
base_experiment: ['45a2921e-064d-58d3-9579-17f5326ba5db']
Question #1 : Do I have to somehow pull the resnet18 backbone from NGC registry to my AWS server (where TAO API is deployed), in order for training to work? I know in the docker container version of TAO you can download via ngc registry model download-version nvidia/tao/pretrained_object_detection:resnet18
– is there an equivalent API call I must do, to make the resnet18 backbone weights available to user-created experiments?
Question #2 : My understand is that I’d need to repeat this for every pretrained model we’d like to use from NGC – is this correct?
-
Also, I tried using the /experiments:base API endpoint to “List Experiments that can be used for transfer learning”, and want to see if the resnet18 backbone is present there. But this endpoint gives a 404 error.
-
Endpoint is
https://<TAO_IP>/tao-api/api/v1/users/<user_id>/experiments:base
-
Error response is:
<Response [404]>
{'code': 404, 'name': 'Not Found', 'description':
'The requested URL was not found on the server.
If you entered the URL manually please check your spelling and try again.'}
Question #3 : Is this a known bug with the /experiments:base
API endpoint and/or is this endpoint still supported in TAO 5.3?
Many thanks!!