Please provide the following information when requesting support.
• Hardware (A100)
• Network Type (Dino)
• Training spec file(If have, please share here)
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 12
precision: fp16
checkpoint_interval: 1
activation_checkpoint: True
pretrained_model_path: /workspace/tao-experiments/dino/dino_fan_large_imagenet22k_36ep.pth
dataset:
train_data_sources:
- image_dir: /data/images/train/
json_file: /data/train/annotations.json
val_data_sources:
- image_dir: /data/images/valid/
json_file: /data/valid/annotations.json
test_data_sources:
image_dir: /data/images/test/
json_file: /data/test/annotations.json
num_classes: 6
batch_size: 16
workers: 15
augmentation:
fixed_padding: True
model:
backbone: fan_large
train_backbone: False
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
I am doing training of Dino with TAO launcher toolkit
Below is my VM resource which contains A100 GPU
I want to utilize the VM resources fully , but i couldn’t go above batch_size > 32 and i couldn’t increase the workers more than 10 while i set batch size = 32
I can only put number of workers to 20 by reducing the batch size to 16
All other effort i put to increase batch size or number of workers results in throwing some memory related errors and training containers stops
But when i check the resources usage the CPUs were not even used 50% and GPU also less than 50%
Can you advice what i can do to utilize the resources efficient here with my VM