Couldnt utilize full machine resources while training

Please provide the following information when requesting support.

• Hardware (A100)
• Network Type (Dino)
• Training spec file(If have, please share here)

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  precision: fp16
  checkpoint_interval: 1
  activation_checkpoint: True
  pretrained_model_path: /workspace/tao-experiments/dino/dino_fan_large_imagenet22k_36ep.pth
dataset:
  train_data_sources:
    - image_dir: /data/images/train/
      json_file: /data/train/annotations.json
  val_data_sources:
    - image_dir: /data/images/valid/
      json_file: /data/valid/annotations.json
  test_data_sources:
    image_dir: /data/images/test/
    json_file: /data/test/annotations.json
  num_classes: 6
  batch_size: 16
  workers: 15
  augmentation:
    fixed_padding: True
model:
  backbone: fan_large
  train_backbone: False
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

I am doing training of Dino with TAO launcher toolkit

Below is my VM resource which contains A100 GPU

image

I want to utilize the VM resources fully , but i couldn’t go above batch_size > 32 and i couldn’t increase the workers more than 10 while i set batch size = 32

I can only put number of workers to 20 by reducing the batch size to 16

All other effort i put to increase batch size or number of workers results in throwing some memory related errors and training containers stops

But when i check the resources usage the CPUs were not even used 50% and GPU also less than 50%

Can you advice what i can do to utilize the resources efficient here with my VM

Please refer to DINO - NVIDIA Docs to do optimization.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.