TLT AP value 0.0 for all classes

We have a custom shoe dataset with 3 classes: air_max, air_force_1, and huaraches. It is a small dataset with 152 images and labels are in the training folder, and 41 images (20%) in the testing folder. When we run TLT training on the sample YOLO notebook, all of our AP values result in 0.0.

This is our config for generating the TFRecords:

kitti_config {
  root_directory_path: "/workspace/shoe_experiment/data/training"
  image_dir_name: "image_2"
  label_dir_name: "label_2"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 20
  num_shards: 10
}
image_directory_path: "/workspace/shoe_experiment/data/training"

And these are our training specs:

    random_seed: 42
    yolo_config {
      big_anchor_shape: "[(114.94, 60.67), (159.06, 114.59), (297.59, 176.38)]"
      mid_anchor_shape: "[(42.99, 31.91), (79.57, 31.75), (56.80, 56.93)]"
      small_anchor_shape: "[(15.60, 13.88), (30.25, 20.25), (20.67, 49.63)]"
      matching_neutral_box_iou: 0.5

      arch: "darknet"
      nlayers: 19
      arch_conv_blocks: 2

      loss_loc_weight: 0.75
      loss_neg_obj_weights: 200.0
      loss_class_weights: 1.0

      freeze_blocks: 0
      freeze_bn: false
    }
    training_config {
      batch_size_per_gpu: 16
      num_epochs: 80
      enable_qat: false
      learning_rate {
      soft_start_annealing_schedule {
        min_learning_rate: 1e-8
        max_learning_rate: 1e-2
        soft_start: 0.1
        annealing: 0.8
        }
      }
      regularizer {
        type: L1
        weight: 5e-5
      }
    }
    eval_config {
      validation_period_during_training: 10
      average_precision_mode: SAMPLE
      batch_size: 16
      matching_iou_threshold: 0.5
    }
    nms_config {
      confidence_threshold: 0.01
      clustering_iou_threshold: 0.6
      top_k: 200
    }
    augmentation_config {
      preprocessing {
        output_image_width: 1248
        output_image_height: 384
        output_image_channel: 3
        crop_right: 1248
        crop_bottom: 384
        min_bbox_width: 1.0
        min_bbox_height: 1.0
      }
      spatial_augmentation {
        hflip_probability: 0.5
        vflip_probability: 0.0
        zoom_min: 0.7
        zoom_max: 1.8
        translate_max_x: 8.0
        translate_max_y: 8.0
      }
      color_augmentation {
        hue_rotation_max: 25.0
        saturation_shift_max: 0.20000000298
        contrast_scale_max: 0.10000000149
        contrast_center: 0.5
      }
    }
    dataset_config {
      data_sources: {
        tfrecords_path: "/workspace/shoe_experiment/data/tfrecords/kitti_trainval/kitti_trainval*"
        image_directory_path: "/workspace/shoe_experiment/data/training"
      }
      image_extension: "png"
      target_class_mapping {
          key: "air_force_1"
          value: "air_force_1"
      }
      target_class_mapping {
          key: "air_max"
          value: "air_max"
      }
      target_class_mapping {
          key: "huaraches"
          value: "huaraches"
      }
    validation_fold: 0
    }

All of the class names in labels are lowercase and we tried various min/max_learning_rate values: (1e-6,1e-4), (1e-14,1e-11), and (1e-8,1e-2) to no avail.

Would you have any suggestions for how to fix the AP 0.0 issue? Do we need to make our dataset larger and if so, how many images would we need per class?

What is the resolution of your images? All is 1248x384?

Thanks for your feedback and good catch - our images are 800x600.

I changed the following:

output_image_width: 800
output_image_height: 600

and this to match:

crop_right: 800
crop_bottom: 600

However, I get the following error:

File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2038, in resize_images
    x.set_shape(transpose_shape(output_shape, data_format, spatial_axes=(1, 2)))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 645, in set_shape
    raise ValueError(str(e))
ValueError: Dimension 2 in both shapes must be equal, but are 36 and 38. Shapes are [16,256,36,50] and [?,?,38,50].

I saw on StackOverflow that the width and height might need to be divisible by 32 - perhaps b/c of the int32 datatype (docs). Is that the case? Would we have to modify all of our image sizes?

Please resize images to be multiples of 16. And also modify the labels accordingly.

See
https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/supported_model_architectures.html#detectnet-v2

DetectNet_v2

  • Input size : C * W * H (where C = 1 or 3, W > =480, H >=272 and W, H are multiples of 16)
  • Image format : JPG, JPEG, PNG
  • Label format : KITTI detection

Thank you for the link! I modified the image heights (and labels) to be 576 instead of 600 so they were divisible by 16.

The good news is that not all the AP values are 0.0 anymore! However, our model seems way overfitted. These were the AP values before and after pruning…
Before:

  • air_force_1 AP 0.0
  • air_max AP 0.002
  • huaraches AP 0.0
  • mAP 0.001

After:

  • air_force_1 AP 0.005
  • air_max AP 0.022
  • huaraches AP 0.0
  • mAP 0.009

The images look like this as a result:

My training specs have the min and max learning rate at 1e-6 and 1e-4.

random_seed: 42
yolo_config {
  big_anchor_shape: "[(114.94, 60.67), (159.06, 114.59), (297.59, 176.38)]"
  mid_anchor_shape: "[(42.99, 31.91), (79.57, 31.75), (56.80, 56.93)]"
  small_anchor_shape: "[(15.60, 13.88), (30.25, 20.25), (20.67, 49.63)]"
  matching_neutral_box_iou: 0.5

  arch: "darknet"
  nlayers: 19
  arch_conv_blocks: 2

  loss_loc_weight: 0.75
  loss_neg_obj_weights: 200.0
  loss_class_weights: 1.0

  freeze_blocks: 0
  freeze_bn: false
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-4
    soft_start: 0.1
    annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 5e-5
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 16
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 800
    output_image_height: 576
    output_image_channel: 3
    crop_right: 800
    crop_bottom: 576
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/shoe_experiment/data/tfrecords/kitti_trainval/kitti_trainval*"
    image_directory_path: "/workspace/shoe_experiment/data/training"
  }
  image_extension: "png"
  target_class_mapping {
      key: "air_force_1"
      value: "air_force_1"
  }
  target_class_mapping {
      key: "air_max"
      value: "air_max"
  }
  target_class_mapping {
      key: "huaraches"
      value: "huaraches"
  }
validation_fold: 0
}

Do you have suggestions for what we could try to improve it?

I find that you are training yolo_v3 instead of detectnet_v2.
So, please refer to below.
And also make sure the labels are also resized accordingly. More, generate the correct big/mid/small anchor shape and fill in the spec.

YOLOv3

  • Input size : C * W * H (where C = 1 or 3, W >= 128, H >= 128, W, H are multiples of 32)
  • Image format : JPG, JPEG, PNG
  • Label format : KITTI detection

Would you happen to have any resources we can look at for generating a new anchor shape?

We ran python kmeans.py -l /workspace/shoe_experiment/data/training/label_2/ -n 9 and used the resulting anchor shapes in our config file:

big_anchor_shape: "[(689.00, 274.00), (682.50, 350.00), (742.00, 487.50)]"
mid_anchor_shape: "[(521.50, 255.50), (696.50, 206.50), (664.00, 234.50)]"
small_anchor_shape: "[(218.00, 198.00), (433.50, 167.50), (348.50, 235.00)]"

Results after pruning:

  • air_force_1 AP .053
  • air_max AP .237
  • huaraches AP 0.0
  • mAP .097

The AP values have improved; however, the results were still extremely overfitted.

Are there any other flags and/or values we should be using in our kmeans command?

usage: kmeans [-h] -l LABEL_FOLDERS [LABEL_FOLDERS ...] [-n NUM_CLUSTERS]
              [--ratio_x RATIO_X] [--ratio_y RATIO_Y] [--max_steps MAX_STEPS]
              [--min_x MIN_X] [--min_y MIN_Y]

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Can you share your training log and latest training spec?
I am afraid the loss does not decrease.

More, can you try lower batch-size?

And please double check your labels are correct after you resized your images.
If they are correct, just keep the anchor shape you have generated.