TLT yolo_v4 slow training


We are trying to train a tlt yolo_v4 model. We have a custom dataset of 25.000 images and are training on 2 GPUs (GeForce RTX 2080 Ti), driver version: 455.32.00, CUDA version: 11.1, TLT version: 3.0.

Despite the small dataset, each epoch takes one hour. Would you say this is expected? Or is something wrong?

The command we used is: tlt yolo_v4 train --gpus 2 -e /path/to/spec.txt -r /path/to/result -k $KEY

Here is an extract from the config:

random_seed: 42
yolov4_config {
big_anchor_shape: “[(87.07, 119.20), (119.47, 87.33), (124.67, 123.07)]”
mid_anchor_shape: “[(78.13, 78.13), (59.73, 105.20), (106.93, 60.80)]”
small_anchor_shape: “[(36.67, 35.87), (48.00, 66.27), (68.13, 48.53)]”
box_matching_iou: 0.25
arch: “cspdarknet”
nlayers: 19
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
training_config {
batch_size_per_gpu: 8
num_epochs: 200
enable_qat: true
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
regularizer {
type: L1
weight: 3e-5
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
eval_config {
average_precision_mode: SAMPLE
batch_size: 16
matching_iou_threshold: 0.5
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
augmentation_config {
hue: 0.1
saturation: 1.5
horizontal_flip: 0.5
jitter: 0.3
output_width: 512
output_height: 288
randomize_input_shape_period: 0
mosaic_prob: 0.5

Thanks for the help

Currently, it is a known limitation in TLT 3.0_dp version. We’re working on that internally.

Thanks for your response. Is there anything we can do in the meantime to improve training time?

also what is the ETA for fixing this issue?

Currently there is not workaround for it. Internal team are working on it for improvement in next release.

Is there not a previous version of the TLT container that doesn’t have this problem?

TLT 2.0_py3 should be faster. But it has not yolo_v4. There is yolo_v3.

Please note that there are new features to tune workers and use_multiprocessing which is in TLT 3.0 version docker. It will help improve training speed.
For example, in yolo_v4, see YOLOv4 — Transfer Learning Toolkit 3.0 documentation

n_worker: The number of workers for data loading per GPU
use_multiprocessing: Whether to use multiprocessing mode of keras sequence data loader

Any update on this problem? Just performed a training of yolo_v4 on ec2 p3.x8. It took 20h of training with TLT vs 9h with Darknet for the same number of epochs…

I didn’t use the parameters you mentioned because I used the default config. but I don’t think that’s the main issue here.

hi, for us the only thing that helped is resize the images manually first and then feed them to the network for training. (there likely is a problem with the default dataloader?) Hope it helps

In tlt 3.0-py3 docker, please set below to help training

  • max_queue_size
  • n_workers
  • use_multiprocessing

See YOLOv4 — Transfer Learning Toolkit 3.0 documentation