Couldnt utilize full machine resources while training

rishikesan · August 3, 2024, 1:42pm

Please provide the following information when requesting support.

• Hardware (A100)
• Network Type (Dino)
• Training spec file(If have, please share here)

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  precision: fp16
  checkpoint_interval: 1
  activation_checkpoint: True
  pretrained_model_path: /workspace/tao-experiments/dino/dino_fan_large_imagenet22k_36ep.pth
dataset:
  train_data_sources:
    - image_dir: /data/images/train/
      json_file: /data/train/annotations.json
  val_data_sources:
    - image_dir: /data/images/valid/
      json_file: /data/valid/annotations.json
  test_data_sources:
    image_dir: /data/images/test/
    json_file: /data/test/annotations.json
  num_classes: 6
  batch_size: 16
  workers: 15
  augmentation:
    fixed_padding: True
model:
  backbone: fan_large
  train_backbone: False
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

I am doing training of Dino with TAO launcher toolkit

Below is my VM resource which contains A100 GPU

I want to utilize the VM resources fully , but i couldn’t go above batch_size > 32 and i couldn’t increase the workers more than 10 while i set batch size = 32

I can only put number of workers to 20 by reducing the batch size to 16

All other effort i put to increase batch size or number of workers results in throwing some memory related errors and training containers stops

But when i check the resources usage the CPUs were not even used 50% and GPU also less than 50%

Can you advice what i can do to utilize the resources efficient here with my VM

Morganh · August 4, 2024, 5:23am

Please refer to DINO - NVIDIA Docs to do optimization.

yingliu · August 27, 2024, 6:24am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · September 10, 2024, 6:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extremely slow train and evaluation of yolo_v4_tiny TAO Toolkit yolo , tao	12	1239	April 12, 2023
TAO Dino training pipeline TAO Toolkit	5	22	August 27, 2024
TAO action recogniton net trainning extremely slow TAO Toolkit tao	20	644	August 7, 2023
Memory usage continue growing up when training TAO Toolkit	5	303	July 4, 2023
System for Training model with TAO TAO Toolkit	2	538	March 2, 2022
Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3 TAO Toolkit	13	177	August 9, 2024
UNet Training on Tao toolkit is getting stuck TAO Toolkit	7	38	September 2, 2024
The container stops in between TAO training TAO Toolkit	3	22	December 9, 2024
Evaluation time issue in EfficientDet TAO Toolkit	10	825	March 29, 2022
OCRNet FAN_tiny_2x Backbone TAO Toolkit	4	260	March 26, 2024

Couldnt utilize full machine resources while training

Related topics