TLT Training duration

Hi,

We are trying to execute a training using TLT on an AWS EC2 Instance, but it takes too much time to complete.

Using 1 gpu → 2:30 hours per epoch.
Using 4 gpus → less than an hour per epoch.

Before trying TLT, we have executed some trainings using Darknet from AlexeyAB repository: GitHub - AlexeyAB/darknet: YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

Even with more images and epochs Darknet seems to train faster than TLT.

Why does TLT take so long to train?

Thanks in advance.

Similar topic: Object detection training duration

Could you please share your training spec?

This is the spec file we are using:

random_seed: 42
yolov4_config {
big_anchor_shape: “[(208.65, 43.20), (171.60, 78.93), (275.93, 290.14)]”
mid_anchor_shape: “[(110.17, 28.80), (195.00, 24.53), (121.88, 53.87)]”
small_anchor_shape: “[(55.57, 22.40), (44.85, 34.13), (73.12, 42.67)]”
box_matching_iou: 0.25
arch: “cspdarknet”
nlayers: 19
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 8
num_epochs: 10
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/home/ubuntu/Documents/prueba_tlt3.0/data/tlt_pretrained_object_detection_vcspdarknet19/cspdarknet_19.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: “/home/ubuntu/Documents/prueba_tlt3.0/data/train/labels/”
image_directory_path: “/home/ubuntu/Documents/prueba_tlt3.0/data/train/images/”
}
include_difficult_in_training: true
target_class_mapping {
key: “Casco”
value: “Casco”
}
target_class_mapping {
key: “Auriculares”
value: “Auriculares”
}
target_class_mapping {
key: “Guantes_latex”
value: “Guantes_latex”
}
target_class_mapping {
key: “Guantes_rojos”
value: “Guantes_rojos”
}
target_class_mapping {
key: “Guantes_azules”
value: “Guantes_azules”
}
target_class_mapping {
key: “Zapato”
value: “Zapato”
}
target_class_mapping {
key: “Bota”
value: “Bota”
}
target_class_mapping {
key: “Caja_auriculares”
value: “Caja_auriculares”
}
target_class_mapping {
key: “Reflector”
value: “Reflector”
}
target_class_mapping {
key: “Persona”
value: “Persona”
}
target_class_mapping {
key: “Auriculares2”
value: “Auriculares2”
}
target_class_mapping {
key: “Guantes_amarillos”
value: “Guantes_amarillos”
}
validation_data_sources: {
label_directory_path: “/home/ubuntu/Documents/prueba_tlt3.0/data/valid/labels/”
image_directory_path: “/home/ubuntu/Documents/prueba_tlt3.0/data/valid/images/”
}
}

Thanks for the info. We will dig out more. BTW, what is the batch-size when you execute some trainings using Darknet from AlexeyAB repository?

We used:
batch=64