TLT yolo_v4 slow training

Hello,

We are trying to train a tlt yolo_v4 model. We have a custom dataset of 25.000 images and are training on 2 GPUs (GeForce RTX 2080 Ti), driver version: 455.32.00, CUDA version: 11.1, TLT version: 3.0.

Despite the small dataset, each epoch takes one hour. Would you say this is expected? Or is something wrong?

The command we used is: tlt yolo_v4 train --gpus 2 -e /path/to/spec.txt -r /path/to/result -k $KEY

Here is an extract from the config:

random_seed: 42
yolov4_config {
big_anchor_shape: “[(87.07, 119.20), (119.47, 87.33), (124.67, 123.07)]”
mid_anchor_shape: “[(78.13, 78.13), (59.73, 105.20), (106.93, 60.80)]”
small_anchor_shape: “[(36.67, 35.87), (48.00, 66.27), (68.13, 48.53)]”
box_matching_iou: 0.25
arch: “cspdarknet”
nlayers: 19
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 8
num_epochs: 200
enable_qat: true
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 16
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0.5
horizontal_flip: 0.5
jitter: 0.3
output_width: 512
output_height: 288
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}

Thanks for the help

Currently, it is a known limitation in TLT 3.0_dp version. We’re working on that internally.

Thanks for your response. Is there anything we can do in the meantime to improve training time?

also what is the ETA for fixing this issue?

Currently there is not workaround for it. Internal team are working on it for improvement in next release.

Is there not a previous version of the TLT container that doesn’t have this problem?

TLT 2.0_py3 should be faster. But it has not yolo_v4. There is yolo_v3.