Mean average precision too low on dimension (640*480) with (detectnetv2+Resnet18)?

I trained model on detect-net v2 with classifiers(Resnet10 and Resnet18). The Mean average Precision on Resnet10 is 37, while on Resnet18 it’s 44.05.

I have one class for training (person) and below datasets details are mentioned,
Training Images : 103,851
Testing Images : 24,414

According to detectnetv2 official documentation, I resize all dateset images to same size (640,480), also resize bounding boxes, but the results are not as I expected.

why this happened, because sometimes deep learning models need more data for best performance. but here situation opposite.

Here My Kitti-train-val file

kitti_config {
  root_directory_path: "/workspace/tlt-experiments/data/training"
  image_dir_name: "image_2"
  label_dir_name: "label_2"
  image_extension: ".jpg"
  partition_mode: "random"
  num_partitions: 2
  val_split: 20
  num_shards: 20
}
image_directory_path: "/workspace/tlt-experiments/data/training"

Below is my training spec file attached for Resnet10

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
    key: "person"
    value: "person"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 640
    output_image_height: 480
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
  color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "person"
    value {
      clustering_config {
clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.9
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  }
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/detectnet_v2/pretrained_resnet10/tlt_pretrained_detectnet_v2_vresnet10/resnet10.hdf5"
  num_layers: 10
  freeze_blocks: 0
  freeze_blocks: 1
  all_projections: True
  use_pooling: False
  use_batch_norm: True
  dropout_rate: 0.0

  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 5
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.5
  }
  evaluation_box_config {
    key: "person"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "person"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 5
}
bbox_rasterizer_config {
  target_class_config {
    key: "person"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius:0.67
}

Below is my training spec file attached for Resnet18

random_seed: 42
dataset_config {
  data_sources {
tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/*"
image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
key: "person"
value: "person"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
output_image_width: 640
output_image_height: 480
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
  }
  spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
  }
  color_augmentation {
  color_shift_stddev: 0.0
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
key: "person"
value {
  clustering_config {
clustering_algorithm: DBSCAN
    dbscan_confidence_threshold: 0.9
    coverage_threshold: 0.005
    dbscan_eps: 0.15
    dbscan_min_samples: 0.05
    minimum_bounding_box_height: 20
  }
}
  }
  }
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/tlt_pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
  num_layers: 18
  freeze_blocks: 0
  freeze_blocks: 1
  all_projections: True
  use_pooling: False
  use_batch_norm: True
  dropout_rate: 0.0

  use_batch_norm: true
  objective_set {
bbox {
  scale: 35.0
  offset: 0.5
}
cov {
}
  }
  training_precision {
backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 5
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
key: "person"
value: 0.5
  }
  evaluation_box_config {
key: "person"
value {
  minimum_height: 4
  maximum_height: 9999
  minimum_width: 4
  maximum_width: 9999
}
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
name: "person"
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
  name: "cov"
  initial_weight: 1.0
  weight_target: 1.0
}
objectives {
  name: "bbox"
  initial_weight: 10.0
  weight_target: 10.0
}
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 120
  learning_rate {
soft_start_annealing_schedule {
  min_learning_rate: 5e-06
  max_learning_rate: 5e-04
  soft_start: 0.1
  annealing: 0.7
}
  }
  regularizer {
type: L1
weight: 3e-9
  }
  optimizer {
adam {
  epsilon: 9.99999993923e-09
  beta1: 0.9
  beta2: 0.999
}
  }
  cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
  }
  checkpoint_interval: 5
}
bbox_rasterizer_config {
  target_class_config {
key: "person"
value {
  cov_center_x: 0.5
  cov_center_y: 0.5
  cov_radius_x: 0.4
  cov_radius_y: 0.4
  bbox_min_radius: 1.0
}
  }
  deadzone_radius:0.67
}

Below Results are on Resnet18.

Thanks.

What is the average resolution of your dataset? You can set the input size to the average resolution of dataset.

Indeed, in current TLT 3.0_dp version, for detectnet_v2 network, it needs resizing images/labels offline. But it is not a must to set to (640,480). See https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/open_model_architectures.html#detectnet-v2

  • Input size : C * W * H (where C = 1 or 3, W > =480, H >=272 and W, H are multiples of 16)

More, could you finetune the batch-size? You can trigger different experiments on it.

And, is the “person” a small object? If yes, see Frequently Asked Questions — Transfer Learning Toolkit 3.0 documentation

Following parameters can help you improve AP on smaller objects:

  • Increase num_layers of resnet
  • class_weight for small objects
  • Increase the coverage_radius_x and coverage_radius_y parameters of the bbox_rasterizer_config section for the small objects class
  • Decrease minimum_detection_ground_truth_overlap
  • Lower minimum_height to cover more small objects for evaluation.
1 Like