TAO 5.0 Training Spec discrepancy

amogh.dabholkar · July 26, 2023, 9:25pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Classification_tf2 and AutoML
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v5.0.0
• Training spec file(If have, please share here) tao-getting-started_v5.0.0/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs/spec.yaml AND https://github.com/NVIDIA/tao_front_end_services/blob/main/api/specs_utils/specs/classification_tf2/classification_tf2%20-%20train.csv
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

For the same classification_tf2 model of EfficientNet_B0, there seem to be 2 different ways in which one can mention training specs.

Mentioned in tao-getting-started_v5.0.0/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs/spec.yaml

There are choices for LR schedule - cosine, step etc. And they are present under train.lr_config as seen below

Mentioned on the latest github repo https://github.com/NVIDIA/tao_front_end_services/blob/main/api/specs_utils/specs/classification_tf2/classification_tf2%20-%20train.csv

Don’t see any choices for LR schedule and the train.lr_config doesn’t seem to exist anymore. I also see added options under train.optim_config but no way to tell which belong to which type of optimizer, eg SGD, Adam, Adadelta etc.

Here are my questions.

I was able to train a model with both specs, but which spec file should be followed for consistency purposes? I am guessing the new one because of its recency.
For the new one, there is no documentation regarding what choices of hyperparams are available, is there any place where I can look it up?

Morganh · July 27, 2023, 7:54am

For TAO 5.0, the spec without TAO-API can be found in https://github.com/NVIDIA/tao_tensorflow2_backend/tree/main/nvidia_tao_tf2/cv/classification/experiment_specs .
Latest 5.0 spec file should be available soon.
Currently, the source code is available. https://github.com/NVIDIA/tao_tensorflow2_backend/blob/main/nvidia_tao_tf2/cv/classification/config/default_config.py#L60

amogh.dabholkar · July 27, 2023, 2:13pm

Thanks for this
So the LR config cannot be set by using AutoML at all? Because I don’t see it in here

github.com

NVIDIA/tao_front_end_services/blob/main/api/specs_utils/specs/classification_tf2/classification_tf2 - train.csv

parameter,display_name,value_type,description,default_value,examples,valid_min,valid_max,valid_options,required,regex,popular,automl_enabled,math_cond,parent_param,depends_on
gpus,,int,,1,,,,,,,,FALSE,,,
results_dir,,hidden,,,,,,,,,,FALSE,,,
encryption_key,Encode key,hidden,,,,,,,,,,FALSE,,,
data_format,,ordered,,channels_first,,,,"channels_first,channels_last",,,,FALSE,,,
train,,collection,,,,,,,,,,FALSE,,,
train.qat,,bool,,FALSE,,,,,,,,FALSE,,,
train.batch_size_per_gpu,,integer,,64,,,,,,,,FALSE,,,
train.num_epochs,,integer,,80,,1,inf,,,,,FALSE,,,
train.checkpoint_interval,Checkpoint Interval,integer,The interval (in epochs) at which train saves intermediate models.,1,,1,inf,,,,,FALSE,,,
train.n_workers,Workers,integer,Number of workers in sequence dataset,10,,1,inf,,,,,FALSE,,,
train.random_seed,Random Seed,integer,Seed value for the random number generator in the network,42,,1,inf,,,,,FALSE,,,
train.checkpoint,Path to Pretrained model,hidden,The path to a pretrained model,,,,,,,,,FALSE,,,
train.label_smoothing,,float,,0.01,,0,1,,,,,,,,
train.reg_config,,collection,,,,,,,,,,FALSE,,,
train.reg_config.weight_decay,,float,,0.00005,,0,inf,,,,,FALSE,,,
train.reg_config.scope,,list,,"[""conv2d"",""dense""]",,,,,,,,FALSE,,,
train.reg_config.type,,string,,L2,,,,,,,,TRUE,,,
train.wandb,,collection,,,,,,,,,,FALSE,,,
train.wandb.entity,,string,,metropolis,,,,,,,,FALSE,,,

This file has been truncated. show original

Also, there is a train.optim_config.lr and a train.lr_config. learning_rate. I am guessing the latter overrrides the former. And yet, the former is by default included in the automl hyperparameters, and there is no mention of whether the latter can be used or not

amogh.dabholkar · July 27, 2023, 3:15pm

Also
By default optimizer_config.beta_1 and optimizer_config.nesterov are enabled for AutoML, but there really is no optimizer in which both of these parameters are used simultaneously, so assuming that by default the optimizer is set at SGD, then what’s the point of having beta_1 enabled? Similarly, if optimizer is set as Adam, what’s the point of having nesterov enabled? Or am I understanding it incorrectly?

And why doesn’t AutoML search through optimizer_config.momentum, optimizer_config.decay and reg_config.weight_decay in v5.0? It did until v4.0.2. Is it not very helpful?

Morganh · July 28, 2023, 6:04am

The csv file is missing the LR config parameters. For the current parameters in csv file for optimizer, users can enable/disable them based on different optimizer.

BTW, for workaround, since we open source the repo, it is also possible for users to decide what they want to enable. Users can modify csv files, and then build a new container.
https://github.com/NVIDIA/tao_front_end_services. User can view the target docker_build in the Makefile for just docker build.

amogh.dabholkar · August 1, 2023, 7:50pm

Do you know when the spec file for 5.0 will be updated?

Morganh · August 2, 2023, 1:34am

For the spec file for 5.0, you can download latest notebook.
Refer to TAO Toolkit Quick Start Guide - NVIDIA Docs

  wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O getting_started_v5.0.0.zip
  unzip -u getting_started_v5.0.0.zip  -d ./getting_started_v5.0.0 && rm -rf getting_started_v5.0.0.zip && cd ./getting_started_v5.0.0

or TAO Toolkit Getting Started | NVIDIA NGC

The 5.0 user guide is also already available.Object Detection - NVIDIA Docs

amogh.dabholkar · August 2, 2023, 2:23pm

It says v4.0.1, doesn’t have information about Image Classification using pytorch, and neither does it have explanation for the new choices for hyperparameters for AutoML.

AutoML parameters aren’t dependent on the notebook right, unless changes are made to the Github here, it wouldn’t let users use the LR config parameters for AutoML right? So my question was regarding that. When do you reckon documentation for this and the AutoML fix will come in?

Morganh · August 3, 2023, 1:44am

Please try again. We fix the issue and it is 5.0.0 now.

Yes, for current version, the LR config is missing. User can use the workaround mentioned above to build a new container. We will fix it in next release.

system · August 17, 2023, 1:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO 4.0 AutoML Detectnet_V2 KeyError on training step TAO Toolkit	19	672	July 15, 2023
PermissionError: [Errno 13] Permission denied: trying to train classification_tf1 TAO Toolkit	7	313	June 25, 2024
Cannot run Dino with tao-5.3.0 TAO Toolkit	7	381	May 17, 2024
TAO action recogniton net trainning extremely slow TAO Toolkit tao	20	621	August 7, 2023
Classification_pyt error TAO Toolkit jetson	16	48	September 18, 2024
AutoML experiment aborted due to error but metadata thinks it is still "Running" TAO Toolkit training	18	770	July 10, 2023
Tao pre-trained yolo4tiny - AssertionError: Must have more boxes than clusters TAO Toolkit	54	2264	January 21, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1485	July 6, 2022
AutoML for v4.0.2 with Efficientnet_b1_relu TAO Toolkit	35	738	July 25, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	674	July 17, 2023

TAO 5.0 Training Spec discrepancy

Related topics