Training accuracy issue with GroundingDINO

Please provide the following information when requesting support.

• Hardware L40s
• Network Type Grounding DINO
• TLT Version TAO 5.5
• Training spec file

train:
  num_gpus: 8
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 0.0002
    lr_steps: [30,80]
    momentum: 0.9
    lr_linear_proj_mult: 0.1
  num_epochs: 100
  freeze: ["backbone.0", "bert"]  # if only finetuning
  precision: bf16
  pretrained_model_path: /code/TAO/grounding_dino_vgrounding_dino_swin_tiny_commercial_trainable_v1.0/grounding_dino_swin_tiny_commercial_trainable.pth
dataset:
  train_data_sources:
    - image_dir: //grounding_object_dataset/data/pinyin/train
      json_file: //grounding_object_dataset/tao_format/pinyin/train_odvg.jsonl
      label_map: //grounding_object_dataset/tao_format/pinyin/train_odvg_labelmap.json
  val_data_sources:
    image_dir: //grounding_object_dataset/data/pinyin/val
    json_file: //grounding_object_dataset/tao_format/pinyin/val_remapped.json
  max_labels: 120
  batch_size: 8
  workers: 8
  dataset_type: serialized  # To reduce the system memory usage
  augmentation:
    scales:
    - 480
    - 512
    - 544
    - 576
    - 608
    - 640
    - 672
    - 704
    - 736
    - 768
    - 800
    input_mean:
    - 0.485
    - 0.456
    - 0.406
    input_std:
    - 0.229
    - 0.224
    - 0.225
    train_random_resize:
    - 480
    - 512
    - 544
    - 576
    - 608
    - 640
    - 672
    - 704
    - 736
    - 768
    - 800
    - 1024
    horizontal_flip_prob: 0.0
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1024
    test_random_resize: 1024
    fixed_padding: true
    fixed_random_crop: null
model:
  backbone: swin_tiny_224_1k
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 1500
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True
  num_select: 1500
  dn_number: 0


• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

While training on my own dataset, I found that the training accuracy could not reach the normal level (I can achieve a map50 of 96 when using other GroundingDINO training frameworks).

However, when using TAO and training for 70 epochs, the map50 only reached 87. My dataset has three categories, and the data volume is 20,000.

Is this performance normal? Could it be that some parameters are not set correctly?
Thanks

May I know the resolution of your dataset?

Thank you for your reply. The size of my images ranges from 800 to 1400 in both width and height, for example, 1024x1279.

Please try to increase the value and do experiments.
For example,
train_random_crop_min: 768
train_random_crop_max: 1280
random_resize_max_size: 1333

One more experiment, comment out below line.
freeze: [“backbone.0”, “bert”] # if only finetuning

Thanks, I’ll have a try~

@Morganh

Hi, Morganh.

During the training of the grounding model, the metrics all appear normal, but during actual inference, different prompt combinations seem to have a significant impact on the results.

What could be the reason for this?

For example, when there is only one prompt as “text,” the bounding box for “text” appears, but if a “dotted text” prompt is added, the bounding box for “text” no longer shows up. ( [“text”] vs [“text”, “dotted text”] ) What might be causing this? During evaluation, the metrics seem to look normal.

Could you please create a new forum topic? Thanks a lot.

1 Like