YOLOv4 accuracy difference between TAO and Darknet

Hi there,

We have been using Darknet for a while now and trained YOLOv4 on our person dataset (one class only) with 28000 images.
For comparison, we have also trained YOLOv4 with TAO using different backbones (CSPDarknet-53, ResetNet-34) and have tweak some parameters in the config.
But when computing mAP@50 on our test set (4000 images), accuracy is always lower with TAO than Darknet.

In both cases, YOLOv4 has been trained with 416x416 inputs. For TAO only we have computed anchor boxes on our dataset.

Here are the best results we got so far:

  • YOLOv4 trained on Darknet for 105 epochs
    • mAP@0.5 = 89.0%
  • YOLOv4 trained on TAO for 120 epochs
    • mAP@0.5 = 84.0%

There is a difference of 5% mAP between the frameworks which is quite a lot. We have tried to tweak some parameters in the config but couldn’t get higher than 84%. Also training is quite long and expensive on EC2 p3 which makes it harder to get the optimal config.

Questions:

  • We saw the blog " Preparing State-of-the-Art Models for Classification and Object Detection with NVIDIA TAO Toolkit" which shows a summary of YOLOv3 SOTA vs. TAO Toolkit accuracy. Have you guys done the same with YOLOv4?
  • During you tests, are you able to achieve a similar accuracy than Darknet or is TAO always lower at the moment? If it is the case at least we don’t spend time trying to optimise.
  • Looking at our spec file, can you see any parameters setup wrongly which could affect our accuracy?
  • Why, in the TAO documentation, is ResNet recommended as default backbone rather than CSPdARKNET-53?

Thank you for your help!

Config:

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(49.09, 46.18), (37.86, 61.15), (51.58, 69.47)]"
  mid_anchor_shape: "[(41.60, 31.62), (32.86, 41.60), (27.46, 53.66)]"
  small_anchor_shape: "[(17.89, 23.30), (28.29, 25.38), (22.05, 35.78)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.50
  arch: "cspdarknet"
  nlayers: 53
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 1
  label_smoothing: 0.1
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.1
  small_grid_xy_extend: 0.2
  freeze_bn: false
  freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 4
  num_epochs: 120
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7 
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tlt-experiments/pretrained/cspdarknet_53.hdf5"
}
eval_config {
  average_precision_mode: INTEGRATE 
  batch_size: 8
  matching_iou_threshold: 0.50
}
nms_config {
  confidence_threshold: 0.005
  clustering_iou_threshold: 0.5
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure: 1.5
  vertical_flip: 0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 416
  output_height: 416
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0.5
  mosaic_min_ratio: 0.2
  image_mean {
    key: 'b'
    value: 103.9
  }
  image_mean {
    key: 'g'
    value: 116.8
  }
  image_mean {
    key: 'r'
    value: 123.7
  }
}
dataset_config {
  data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/train/labels"
      image_directory_path: "/workspace/tlt-experiments/data/train/images"
  }
  include_difficult_in_training: true
  target_class_mapping {
      key: "person"
      value: "person"
  }
  validation_data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/test/labels"
      image_directory_path: "/workspace/tlt-experiments/data/test/images"
  }
}

Several comments here.

  1. Please use ImageNet-pretrained weight. As mentioned in the blog, you need to train classification models on the ImageNet 2012 classification dataset. Then this ImageNet-pretrained weights can be a starting point to train your YoloV4 model. Pretrained weights trained on the ImageNet dataset tend to provide good accuracy for object detection.
  2. Please finetune below parameters, for example:
    freeze_blocks: 0, comment out this: #freeze_blocks: 0
    weight: 3e-5 → weight: 3e-6
  3. We will have focal loss for yolov4, which will also improve mAP, as we tested.
  4. In Tao documentation, there is not recommended backbone. End user can modify to any backbone. It just set a typical backbone.

Ok thanks. We will try to train with ImageNet and we will get back to you. Also we have to check the licensing as we are a company not researchers and it seems ImageNet is only usable by researches.

BTW, for Darknet, which config did you use in darknet/cfg at master · AlexeyAB/darknet · GitHub ?

We used the yolov4-custom.cfg with 416x416, max_batches=45 0000 (around 105 epochs) and for 1 class. We haven’t modify the other parameters or the anchor boxes.