The tao 5.5.0 launcher dino notebook is not working with default settings

peter.soetens1 · November 13, 2024, 1:53pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
RTX 4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Dino as configured in the tao launcher notebook
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.5.0
• Training spec file(If have, please share here)
Defaults from the notebook/github.
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I have exactly the same issue as this user: Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3

I just went throught the notebook, and could reproduce the issue on 2 different systems, one ubuntu 20.04 and one ubuntu 22.04.

After 12 epochs, the AP is around zero, while it should be around 50. That thread suggested two things:

to check the categories id numbering, but since this is coco2017, this is all set correctly.
to increase the num_queries from 300 back to 900

How does the author of the notebook use/validate that it is actually correct if the claimed AP can not be reproduced? These are other issues reported by that notebook:

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None

No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.

Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=[‘out_norm1.weight’, ‘out_norm1.bias’, ‘out_norm2.weight’, ‘out_norm2.bias’, ‘out_norm3.weight’, ‘out_norm3.bias’, ‘learnable_downsample.weight’, ‘learnable_downsample.bias’], unexpected_keys=[‘norm.weight’, ‘norm.bias’, ‘head.fc.weight’, ‘head.fc.bias’])

To save memory, I did set the precision to fp16, so I was not entirely ‘default’, but allowed from the nvidia DINO documentation ( DINO - NVIDIA Docs ). My full edits to the train.yaml file are:

diff --git a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
index e63f8cd..e9b6d22 100644
--- a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
+++ b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
@@ -7,7 +7,9 @@ train:
     lr: 2e-4
     lr_steps: [11]
     momentum: 0.9
     num_epochs: 12
+  precision: fp16
+  activation_checkpoint: True
 dataset:
   train_data_sources:
     - image_dir: /data/raw-data/train2017/
@@ -17,9 +19,9 @@ dataset:
       json_file: /data/raw-data/annotations/instances_val2017.json
   num_classes: 91
   batch_size: 4
-  workers: 8
+  workers: 16
   augmentation:
-    fixed_padding: False
+    fixed_padding: True
 model:
   backbone: fan_small
   train_backbone: True

I have now restarted training this setup with model.num_queries: 900 but if this fixes it, the notebook should be modified upstream.

Morganh · November 14, 2024, 9:08am

Please refer to below spec file as well.

train:
  num_gpus: 8
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 1e-05
    lr: 0.0001
    lr_steps: [30]
    momentum: 0.9
    layer_decay_rate: 0.65
  num_epochs: 36
dataset:
  train_data_sources:
    - image_dir: "???"
      json_file: "???"
  val_data_sources:
    - image_dir: "???"
      json_file: "???"
  num_classes: 91
  batch_size: 2
  workers: 8
  augmentation:
    fixed_random_crop: 1536
    test_random_resize: 1536
    random_resize_max_size: 1536
    fixed_padding: True
model:
  pretrained_backbone_path: "???"
  backbone: vit_large_nvdinov2
  train_backbone: False
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048

More, more pretrained models can be found in TAO Pretrained DINO with Foundational Model Backbone | NVIDIA NGC and Pre-trained DINO ImageNet weights | NVIDIA NGC.

peter.soetens1 · November 20, 2024, 11:11am

I can confirm now that the num_queries: 300 is the culprit. Raising it to 900 is sufficient to make the notebook perform as advertised. I didn’t see noticeable training slow-downs from raising this parameter.

Thanks for the updated train.yaml file, will take a look at it as well.

peter.soetens1 · November 20, 2024, 3:26pm

I have tested the train.yaml file you just provided, and it does not work, breaks in the ViT pretrained backbone interpolation step. I could download the pth file using the ngc command from TAO Pretrained DINO with Foundational Model Backbone | NVIDIA NGC :
ngc registry model download-version nvidia/tao/dino_with_fm_backbone:trainable_v1.0 --dest $LOCAL_PROJECT_DIR/dino/

and enter the correct paths you marked as ??? in the yaml file.

The online card says to write ‘vit_large_dinov2’ as a backbone, you wrote ‘vit_large_nvdinov2’, but either way it does not work and produces the same error:

Train results will be saved at: /results/train
Do ViT pretrained backbone interpolation
Error executing job with overrides: ['results_dir=/results/']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 76, in run_experiment
    pt_model = lightning_module(experiment_config)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 54, in __init__
    self._build_model(export)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 61, in _build_model
    self.model = build_model(experiment_config=self.experiment_spec, export=export)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/build_nn_model.py", line 306, in build_model
    model = DINOModel(num_classes=num_classes,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/build_nn_model.py", line 159, in __init__
    backbone_only = Backbone(backbone,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/backbone.py", line 224, in __init__
    pretrained_backbone_ckp = interpolate_vit_checkpoint(checkpoint=pretrained_backbone_ckp,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/backbone.py", line 301, in interpolate_vit_checkpoint
    checkpoint = interpolate_patch_embed(checkpoint=checkpoint, new_patch_size=target_patch_size)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/utils/pos_embed_interpolation_converter.py", line 87, in interpolate_patch_embed
patch_embed = checkpoint['patch_embed.proj.weight']KeyError: 'patch_embed.proj.weight'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Do you have an idea?

system · December 4, 2024, 3:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Train.yaml Doesn't exist! TAO Toolkit	16	478	June 11, 2024
Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3 TAO Toolkit	13	176	August 9, 2024
PermissionError: [Errno 13] Permission denied: trying to train classification_tf1 TAO Toolkit	7	361	June 25, 2024
Issue Running Inference on NVIDIA TAO Retail Object Recognition Model TAO Toolkit python , tao , retail-object-detection	4	43	February 21, 2025
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1716	August 22, 2023
"Permission Denied" when launching the heartratenet notebook TAO Toolkit	11	678	November 1, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1511	July 6, 2022
Fine Tuning Retail Object Detection Models provided in NGC TAO Toolkit ngc	17	115	February 7, 2025
Fine Tuning DINO Retail Object detector - error out as it expects unspecified/unknown configurations TAO Toolkit cudnn , retail-object-detection	6	44	December 30, 2024
Add class to pretrained model using TAO 5.5.0 TAO Toolkit	2	21	February 4, 2025

The tao 5.5.0 launcher dino notebook is not working with default settings

Related topics