The tao 5.5.0 launcher dino notebook is not working with default settings

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
RTX 4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Dino as configured in the tao launcher notebook
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.5.0
• Training spec file(If have, please share here)
Defaults from the notebook/github.
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I have exactly the same issue as this user: Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3

I just went throught the notebook, and could reproduce the issue on 2 different systems, one ubuntu 20.04 and one ubuntu 22.04.

After 12 epochs, the AP is around zero, while it should be around 50. That thread suggested two things:

  • to check the categories id numbering, but since this is coco2017, this is all set correctly.
  • to increase the num_queries from 300 back to 900

How does the author of the notebook use/validate that it is actually correct if the claimed AP can not be reproduced? These are other issues reported by that notebook:

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None

No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.

Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=[‘out_norm1.weight’, ‘out_norm1.bias’, ‘out_norm2.weight’, ‘out_norm2.bias’, ‘out_norm3.weight’, ‘out_norm3.bias’, ‘learnable_downsample.weight’, ‘learnable_downsample.bias’], unexpected_keys=[‘norm.weight’, ‘norm.bias’, ‘head.fc.weight’, ‘head.fc.bias’])

To save memory, I did set the precision to fp16, so I was not entirely ‘default’, but allowed from the nvidia DINO documentation ( DINO - NVIDIA Docs ). My full edits to the train.yaml file are:

diff --git a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
index e63f8cd..e9b6d22 100644
--- a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
+++ b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
@@ -7,7 +7,9 @@ train:
     lr: 2e-4
     lr_steps: [11]
     momentum: 0.9
     num_epochs: 12
+  precision: fp16
+  activation_checkpoint: True
 dataset:
   train_data_sources:
     - image_dir: /data/raw-data/train2017/
@@ -17,9 +19,9 @@ dataset:
       json_file: /data/raw-data/annotations/instances_val2017.json
   num_classes: 91
   batch_size: 4
-  workers: 8
+  workers: 16
   augmentation:
-    fixed_padding: False
+    fixed_padding: True
 model:
   backbone: fan_small
   train_backbone: True

I have now restarted training this setup with model.num_queries: 900 but if this fixes it, the notebook should be modified upstream.

Please refer to below spec file as well.

train:
  num_gpus: 8
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 1e-05
    lr: 0.0001
    lr_steps: [30]
    momentum: 0.9
    layer_decay_rate: 0.65
  num_epochs: 36
dataset:
  train_data_sources:
    - image_dir: "???"
      json_file: "???"
  val_data_sources:
    - image_dir: "???"
      json_file: "???"
  num_classes: 91
  batch_size: 2
  workers: 8
  augmentation:
    fixed_random_crop: 1536
    test_random_resize: 1536
    random_resize_max_size: 1536
    fixed_padding: True
model:
  pretrained_backbone_path: "???"
  backbone: vit_large_nvdinov2
  train_backbone: False
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048

More, more pretrained models can be found in TAO Pretrained DINO with Foundational Model Backbone | NVIDIA NGC and Pre-trained DINO ImageNet weights | NVIDIA NGC.

I can confirm now that the num_queries: 300 is the culprit. Raising it to 900 is sufficient to make the notebook perform as advertised. I didn’t see noticeable training slow-downs from raising this parameter.

Thanks for the updated train.yaml file, will take a look at it as well.

I have tested the train.yaml file you just provided, and it does not work, breaks in the ViT pretrained backbone interpolation step. I could download the pth file using the ngc command from TAO Pretrained DINO with Foundational Model Backbone | NVIDIA NGC :
ngc registry model download-version nvidia/tao/dino_with_fm_backbone:trainable_v1.0 --dest $LOCAL_PROJECT_DIR/dino/

and enter the correct paths you marked as ??? in the yaml file.

The online card says to write ‘vit_large_dinov2’ as a backbone, you wrote ‘vit_large_nvdinov2’, but either way it does not work and produces the same error:

Train results will be saved at: /results/train
Do ViT pretrained backbone interpolation
Error executing job with overrides: ['results_dir=/results/']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 76, in run_experiment
    pt_model = lightning_module(experiment_config)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 54, in __init__
    self._build_model(export)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 61, in _build_model
    self.model = build_model(experiment_config=self.experiment_spec, export=export)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/build_nn_model.py", line 306, in build_model
    model = DINOModel(num_classes=num_classes,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/build_nn_model.py", line 159, in __init__
    backbone_only = Backbone(backbone,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/backbone.py", line 224, in __init__
    pretrained_backbone_ckp = interpolate_vit_checkpoint(checkpoint=pretrained_backbone_ckp,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/backbone.py", line 301, in interpolate_vit_checkpoint
    checkpoint = interpolate_patch_embed(checkpoint=checkpoint, new_patch_size=target_patch_size)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/utils/pos_embed_interpolation_converter.py", line 87, in interpolate_patch_embed
patch_embed = checkpoint['patch_embed.proj.weight']KeyError: 'patch_embed.proj.weight'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Do you have an idea?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.