I am getting following error when trying to train the model in TAO 5.5.
Its looking for this configuration cudnn.benchmark = cfg["train"]["cudnn"]["benchmark"]
, but I cant find any such configuration in TAO DINO documentation
tao model dino train \
-e /workspace/tao-experiments/specs/train.yml
2024-11-22 03:25:19,278 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-11-22 03:25:19,368 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-11-22 03:25:19,382 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-11-22 03:25:27,199 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
/usr/local/lib/python3.10/dist-packages/hydra/plugins/config_source.py:124: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
deprecation_warning(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /workspace/tao-experiments/results/trainings/training1/status.json
rank_zero_warn(
Seed set to 1234
Train results will be saved at: /workspace/tao-experiments/results/trainings/training1
Error executing job with overrides: []Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
raise e
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
run_experiment(experiment_config=cfg,
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 36, in run_experiment
results_dir, resume_ckpt, gpus, ptl_loggers = initialize_train_experiment(experiment_config, key)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/initialize_experiments.py", line 56, in initialize_train_experiment
cudnn.benchmark = cfg["train"]["cudnn"]["benchmark"]omegaconf.errors.ConfigKeyError: Key 'cudnn' is not in struct
full_key: train.cudnn
object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-11-22 03:25:35,916 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-11-22 03:25:35,916 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-11-22 03:25:35,916 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'dino', 'gpu': ['Tesla-V100-SXM2-16GB'], 'success': False, 'time_lapsed': 8} to https://api.tao.ngc.nvidia.com.
[2024-11-22 03:25:37,147 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-11-22 03:25:37,148 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-11-22 03:25:37,148 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-11-22 03:25:38,297 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
And following is the configuration file:
train:
freeze: ['backbone', 'transformer.encoder']
pretrained_model_path: /workspace/tao-experiments/models/retail_object_detection_vtrainable_retail_object_detection_binary_v2.2.2.3/dino_model_epoch011.pth
num_gpus: 1
num_nodes: 1
validation_interval: 1
checkpoint_interval: 1
seed: 1234
results_dir: /workspace/tao-experiments/results/trainings/training1
optim:
lr_backbone: 1e-6
lr: 1e-5
lr_steps: [11]
momentum: 0.9
num_epochs: 12
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/train
json_file: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/annotations/instances_train.json
val_data_sources:
- image_dir: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/test
json_file: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/annotations/instances_test.json
num_classes: 2
batch_size: 4
workers: 8
augmentation:
fixed_padding: False
model:
backbone: fan_base
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
results_dir: /workspace/tao-experiments/results/trainings/training1
encryption_key: nvidia_tao
Based on the pytorch repo, it seems its looking for other configurations such as cfg["train"]["cudnn"]["deterministic"], cfg["train"]["cudnn"]["benchmark"]
which are not defined in documentation.
- Can you please explain why I am getting this errors? (dont they have default values specified).
- And if I am suppose to specify values, can you let me know the values for the above two configurations? Thanks.