I am training efficientdet model on 56k images, during training it takes 30 minutes for each epoch with 70-100% GPU utilization , while during evaluation cycle, each evaluation cycle requires 4-5 hrs with almost 0% GPU utilization. after every 30-50 seconds, only for 2-5 seconds it shows 40-45% .
also, after every checkpoint saving it logs the list of warnings:
How many images in your evaluate dataset? Could you also share the training spec as well? Thanks.
training_config {
train_batch_size: 20
iterations_per_loop: 10
checkpoint_period: 2
num_examples_per_epoch: 55743
num_epochs: 20
#model_name: ‘efficientdet-d1’
profile_skip_steps: 100
tf_random_seed: 42
lr_warmup_epoch: 1
lr_warmup_init: 1e-05
learning_rate: 0.0005
amp: True
moving_average_decay: 0.9999
l2_weight_decay: 0.0001
l1_weight_decay: 0.0
checkpoint: “/workspace/TAO/pretrained_models/efficientdet_pretrained_models/efficientnet_b2.hdf5”
skip_checkpoint_variables: “-predict*”
}
dataset_config {
num_classes: 3
image_size: “544,960”
training_file_pattern: “/workspace/TAO/T_4/tfrecords/train-"
validation_file_pattern: "/workspace/TAO/T_4/tfrecords/val- ”
validation_json_file: “/workspace/DAB/D_2/val/val_COCO.json”
max_instances_per_image: 100
skip_crowd_during_training: True
}
eval_config {
eval_batch_size: 24
eval_epoch_cycle: 2
eval_after_training: True
eval_samples: 19666
min_score_thresh: 0.4
max_detections_per_image: 100
}
model_config {
model_name: “efficientdet-d2”
min_level: 3
max_level: 7
num_scales: 3
anchor_scale : 1
}
augmentation_config {
Could you try to use "tao efficientdet evaluate xxx " against one of the .tlt models to double check?
More, you can set below to narrow down.
lower eval_batch_size
or lower eval_samples
took longer as usual.
i tried with lowering the batch size upto 8 but the same issue was same.
Could you run below command? I just want to check the size of train/val tfrecords files.
$ ll -sh /workspace/TAO/T_4/tfrecords/*
the tfrecords size is 35GB
Can you share the result of $ ll -sh /workspace/TAO/T_4/tfrecords/*
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00000-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00001-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00002-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00003-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00004-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00005-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00006-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00007-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00008-of-00010.tfrecord
2.7G -rw-r–r-- 1 root root 2.7G Mar 21 11:14 /workspace/TAO/T_4/tfrecords/train-00009-of-00010.tfrecord
829M -rw-r–r-- 1 root root 829M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00000-of-00010.tfrecord
820M -rw-r–r-- 1 root root 820M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00001-of-00010.tfrecord
826M -rw-r–r-- 1 root root 826M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00002-of-00010.tfrecord
823M -rw-r–r-- 1 root root 823M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00003-of-00010.tfrecord
822M -rw-r–r-- 1 root root 822M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00004-of-00010.tfrecord
827M -rw-r–r-- 1 root root 827M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00005-of-00010.tfrecord
824M -rw-r–r-- 1 root root 824M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00006-of-00010.tfrecord
820M -rw-r–r-- 1 root root 820M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00007-of-00010.tfrecord
819M -rw-r–r-- 1 root root 819M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00008-of-00010.tfrecord
829M -rw-r–r-- 1 root root 829M Mar 21 11:15 /workspace/TAO/T_4/tfrecords/val-00009-of-00010.tfrecord
After checking, during evaluation, NMS is using a numpy implementation, which runs on CPU.
So it is reasonable to see this result.
You can reduce the evaluation number to 500 or something else during training. This will help you shorten the evaluation time. Then, run evaluate
with the full size after training is done.
BTW, for efficientdet-d2, you can consider using yolov4 instead. Refer to https://www.researchgate.net/figure/Comparison-of-the-proposed-YOLOv4-and-other-state-of-the-art-object-detectors-YOLOv4_fig1_340883401
system
Closed
April 12, 2022, 3:56pm
12
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.