Hi there,
We have been using Darknet for a while now and trained YOLOv4 on our person dataset (one class only) with 28000 images.
For comparison, we have also trained YOLOv4 with TAO using different backbones (CSPDarknet-53, ResetNet-34) and have tweak some parameters in the config.
But when computing mAP@50 on our test set (4000 images), accuracy is always lower with TAO than Darknet.
In both cases, YOLOv4 has been trained with 416x416 inputs. For TAO only we have computed anchor boxes on our dataset.
Here are the best results we got so far:
- YOLOv4 trained on Darknet for 105 epochs
- mAP@0.5 = 89.0%
- YOLOv4 trained on TAO for 120 epochs
- mAP@0.5 = 84.0%
There is a difference of 5% mAP between the frameworks which is quite a lot. We have tried to tweak some parameters in the config but couldn’t get higher than 84%. Also training is quite long and expensive on EC2 p3 which makes it harder to get the optimal config.
Questions:
- We saw the blog " Preparing State-of-the-Art Models for Classification and Object Detection with NVIDIA TAO Toolkit" which shows a summary of YOLOv3 SOTA vs. TAO Toolkit accuracy. Have you guys done the same with YOLOv4?
- During you tests, are you able to achieve a similar accuracy than Darknet or is TAO always lower at the moment? If it is the case at least we don’t spend time trying to optimise.
- Looking at our spec file, can you see any parameters setup wrongly which could affect our accuracy?
- Why, in the TAO documentation, is ResNet recommended as default backbone rather than CSPdARKNET-53?
Thank you for your help!
Config:
random_seed: 42
yolov4_config {
big_anchor_shape: "[(49.09, 46.18), (37.86, 61.15), (51.58, 69.47)]"
mid_anchor_shape: "[(41.60, 31.62), (32.86, 41.60), (27.46, 53.66)]"
small_anchor_shape: "[(17.89, 23.30), (28.29, 25.38), (22.05, 35.78)]"
box_matching_iou: 0.25
matching_neutral_box_iou: 0.50
arch: "cspdarknet"
nlayers: 53
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 1
label_smoothing: 0.1
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 4
num_epochs: 120
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: "/workspace/tlt-experiments/pretrained/cspdarknet_53.hdf5"
}
eval_config {
average_precision_mode: INTEGRATE
batch_size: 8
matching_iou_threshold: 0.50
}
nms_config {
confidence_threshold: 0.005
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure: 1.5
vertical_flip: 0
horizontal_flip: 0.5
jitter: 0.3
output_width: 416
output_height: 416
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio: 0.2
image_mean {
key: 'b'
value: 103.9
}
image_mean {
key: 'g'
value: 116.8
}
image_mean {
key: 'r'
value: 123.7
}
}
dataset_config {
data_sources: {
label_directory_path: "/workspace/tlt-experiments/data/train/labels"
image_directory_path: "/workspace/tlt-experiments/data/train/images"
}
include_difficult_in_training: true
target_class_mapping {
key: "person"
value: "person"
}
validation_data_sources: {
label_directory_path: "/workspace/tlt-experiments/data/test/labels"
image_directory_path: "/workspace/tlt-experiments/data/test/images"
}
}