YOLOv3+darknet53 encountered low mAP on VOC dataset

Lapino · October 22, 2020, 2:23am

Hi,
I tried to train yolov3 with darknet-53 pre-trained backbone on the VOC07+12 dataset, the mAP (INTEGRATE) score is only about 57% after 120 epochs, which has a significant discrepancy compared with benchmark (about 82% mAP). And the loss dropped at about 3 after 120 epochs. Does anyone has the similar issue or some clues about my problem? I’ve attached my config file below. Thanks!

random_seed: 42
yolo_config {
big_anchor_shape: “[(116.00, 90.00), (156.00, 198.00), (373.00, 326.00)]”
mid_anchor_shape: “[(30.00, 61.00), (62.00, 45.00), (59.00, 119.00)]”
small_anchor_shape: “[(10.00, 13.00), (16.00, 30.00),(33.00, 23.00)]”
matching_neutral_box_iou: 0.5

arch: “darknet”
nlayers: 53
arch_conv_blocks: 2

loss_loc_weight: 0.75
loss_neg_obj_weights: 200.0
loss_class_weights: 1.0

freeze_blocks: 0
freeze_bn: false
}
training_config {
batch_size_per_gpu: 16
num_epochs: 120
enable_qat: false
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 1e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 5e-5
}
}
eval_config {
validation_period_during_training: 5
average_precision_mode: INTEGRATE
batch_size: 16
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
preprocessing {
output_image_width: 416
output_image_height: 416
output_image_channel: 3
crop_right: 416
crop_bottom: 416
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.5
zoom_min: 0.7
zoom_max: 1.8
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tlt-experiment/tfrecords/images_trainval/*”
image_directory_path: “/workspace/tlt-experiment/voc”
}
image_extension: “jpg”

target_class_mapping {
key: “tvmonitor”
value: “tvmonitor”
}
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “bottle”
value: “bottle”
}
target_class_mapping {
key: “diningtable”
value: “diningtable”
}
target_class_mapping {
key: “bicycle”
value: “bicycle”
}
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “horse”
value: “horse”
}
target_class_mapping {
key: “bird”
value: “bird”
}
target_class_mapping {
key: “dog”
value: “dog”
}
target_class_mapping {
key: “boat”
value: “boat”
}
target_class_mapping {
key: “sofa”
value: “sofa”
}
target_class_mapping {
key: “cat”
value: “cat”
}
target_class_mapping {
key: “bus”
value: “bus”
}
target_class_mapping {
key: “motorbike”
value: “motorbike”
}
target_class_mapping {
key: “aeroplane”
value: “plane”
}
target_class_mapping {
key: “train”
value: “train”
}
target_class_mapping {
key: “pottedplant”
value: “pottedplant”
}
target_class_mapping {
key: “chair”
value: “chair”
}
target_class_mapping {
key: “sheep”
value: “sheep”
}
target_class_mapping {
key: “cow”
value: “cow”
}

validation_fold: 0
}

Morganh · October 22, 2020, 5:44am

Can you share more details(link or paper,etc) about the benchmark (about 82% mAP) you mentioned?

Lapino · October 22, 2020, 6:00am

Morganh · October 22, 2020, 6:14am

The pre-trained weights or models play an important role in training mAP.
As mentioned in https://ngc.nvidia.com/catalog/models/nvidia:tlt_pretrained_object_detection, the tlt pretrained model is trained on a subset of the Google OpenImages dataset.

To improve mAP, replacing pretrained model can be an option. For example, use tlt classification network to train a pretrained tlt model with Imagenet dataset.

Morganh · November 3, 2020, 7:18am

Share one experimental result with you.
With the darknet_53 pre-trained model in ngc, I can get an mAP of 63.4%.
Attach spec.

random_seed: 42
yolo_config {
big_anchor_shape: “[(162.24, 293.97),(265.40, 183.64), (348.61, 356.10) ]”
mid_anchor_shape: “[(49.92, 122.03), (122.30, 109.82), (88.19, 208.00) ]”
small_anchor_shape: “[(19.14, 28.82),(26.62, 70.17),(58.24, 48.81)]”
matching_neutral_box_iou: 0.5

arch: “darknet”
nlayers: 53
arch_conv_blocks: 2

loss_loc_weight: 0.75
loss_neg_obj_weights: 200.0
loss_class_weights: 1.0

freeze_blocks: 0
freeze_bn: false
}
training_config {
batch_size_per_gpu: 16
num_epochs: 160
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 1e-4
soft_start: 0.1
annealing: 0.5
}
}
regularizer {
type: L2
weight: 3.0e-05
}
}
eval_config {
validation_period_during_training: 10
average_precision_mode: SAMPLE
batch_size: 32
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
preprocessing {
output_image_width: 416
output_image_height: 416
output_image_channel: 3
crop_right: 416
crop_bottom: 416
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 0.7
zoom_max: 1.8
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}

dataset_config {
data_sources {
tfrecords_path: “/workspace/tlt-experiments/data/tfrecords_416_416/pascal_voc/pascal_voc07_trainval*”
image_directory_path: “/workspace/tlt-experiments/data/VOCdevkit/VOC2007”
}
data_sources {
tfrecords_path: “/workspace/tlt-experiments/data/tfrecords_416_416/pascal_voc/pascal_voc12_trainval*”
image_directory_path: “/workspace/tlt-experiments/data/VOCdevkit/VOC2012”
}
image_extension: “jpg”
target_class_mapping {
key: “tvmonitor”
value: “tvmonitor”
}
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “bottle”
value: “bottle”
}
target_class_mapping {
key: “diningtable”
value: “diningtable”
}
target_class_mapping {
key: “bicycle”
value: “bicycle”
}
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “horse”
value: “horse”
}
target_class_mapping {
key: “bird”
value: “bird”
}
target_class_mapping {
key: “dog”
value: “dog”
}
target_class_mapping {
key: “boat”
value: “boat”
}
target_class_mapping {
key: “sofa”
value: “sofa”
}
target_class_mapping {
key: “cat”
value: “cat”
}
target_class_mapping {
key: “bus”
value: “bus”
}
target_class_mapping {
key: “motorbike”
value: “motorbike”
}
target_class_mapping {
key: “aeroplane”
value: “plane”
}
target_class_mapping {
key: “train”
value: “train”
}
target_class_mapping {
key: “pottedplant”
value: “pottedplant”
}
target_class_mapping {
key: “chair”
value: “chair”
}
target_class_mapping {
key: “sheep”
value: “sheep”
}
target_class_mapping {
key: “cow”
value: “cow”
}

validation_data_source {
tfrecords_path: “/workspace/tlt-experiments/data/tfrecords_416_416/pascal_voc/pascal_voc07_test*”
image_directory_path: “/workspace/tlt-experiments/data/VOCdevkit/VOC2007”
}

}

Lapino · November 4, 2020, 4:32pm

Thanks a lot for your sharing! By the way, I would like to ask more details about this experiment and training configurations in order to improve mine:

For the ‘regularizer’, what would be the difference between L1 and L2 (I followed the official TLT guide which recommended L1), and would that have some impact on accuracy?
For the ‘annealing’ point, my choice was a little bite later, like 0.7 (actually I’m not familiar with this training schedule), are there any common rules for choosing this value?
How did you get those anchors’ sizes for Yolo? I tired the kmeans.py provided in the yolo example of TLT, however I got different anchors than yours (ps: my dataset is voc07+12 resized on 416*416)
Thank you!

Morganh · November 5, 2020, 4:54am

Actually I did not trigger more experiments to finetune all the parameters. I just copy one of my old training spec of yolo_v3 and start training.
For 1), L1 training is easier for pruning. I think you can keep your L2 setting. It should not have significant impact on accuracy. If you have time, you can trigger experiments for both L1 and L2.
For 2), For annealing, you can check tlt-user guide https://pgambrill.gitlab-master-pages.nvidia.com/tlt-docs/text/creating_experiment_spec.html#specification-file-for-detectnet-v2 . There is not common rules. If set to 0.7, the annealing time is shorter, the time which stays at max_lr is longer. If set to 0.5, the annealing time is longer, the time which stays at max_lr is shorter. You can trigger experiments to check which is better for your case.
For 3), I think this have more impact on accuracy. Yes, I use kmeans.py, but I run it against voc07 training dataset along with voc2012 training dataset. I copy the two dataset into one folder, all is resized to 416x416.

More, I want to highlight another difference. I set different val dataset. I use voc07 test datset.

validation_data_source {
tfrecords_path: “/workspace/tlt-experiments/data/tfrecords_416_416/pascal_voc/pascal_voc07_test*”
image_directory_path: “/workspace/tlt-experiments/data/VOCdevkit/VOC2007”
}

Lapino · November 5, 2020, 6:11am

Thanks you for your detailed answer! I will try more experiments with different configs to see if I could get better result. Besides, is there any chance you could share the training log of your experiment, please? I would like to compare how the loss varies.

Morganh · November 5, 2020, 6:43am

joblog.zip (961.3 KB) Attach the log.

Topic		Replies	Views
TLT with YOLOv3 Achieved 0 MaP after 120 Epoch TAO Toolkit	10	889	October 12, 2021
YOLOv4 accuracy difference between TAO and Darknet TAO Toolkit	5	1509	October 12, 2021
Reproducing YoloV4 COCO mAP TAO Toolkit	17	1496	March 2, 2022
During training, the mAP value becomes 0 TAO Toolkit	2	748	October 12, 2021
Performance of TAO 3.22.05 and TAO 4.0.1 is lower than TAO 3.21.08 TAO Toolkit	9	487	June 15, 2023
Is it possible to adjust class_weight in YOLOv4 like DetectNet v2? TAO Toolkit	7	1249	October 12, 2021
mAP and every AP not improving while training TLT YOLO_V4 with custom data TAO Toolkit	6	584	October 12, 2021
YOLO V4 TLT Training Low mAP TAO Toolkit	6	1085	October 12, 2021
High ram usage with tlt ResNet TAO Toolkit	42	997	July 6, 2022
YOLOv4: loss converges well to 0, but the inference result mAP is always 0 TAO Toolkit	3	1056	October 12, 2021

YOLOv3+darknet53 encountered low mAP on VOC dataset

Related topics