Training Custom Object detector with 6 classes

Hi Guys,

I am training a custom object detection model (resnet-10 and detectnet_v2) for 6 classes using VOC/COCO dataset. I have convert these datasets to kitti data format, created the recorder, edited the spec file for multi-class detector. However, when I evaluate the trained model after 50 epochs, I do not get reliable average precision figures.

I am getting the following MAP results:

class name      average precision (in %)
------------  --------------------------
bicycle                         0
bus                             2.48739
car                             0
motorbike                       0.42388
person                          6.92905
truck                           0.442265

During training and evaluate I got the following message:

target/truncation is not updated to match the crop areaif the dataset contains target/truncation.

During evaluate, I got the following message:

One or more metadata field(s) are missing from ground_truth batch_data, and will be replaced with defaults: ['frame/camera_location']

Following is the statistic for number of data samples:

Number of images in the trainval set. 319492
Number of labels in the trainval set. 319492
Number of images in the test set. 7518

Kindly check a sample data format used for recorder conversion below below:

car 0.0 0 -1 141 50 500 330 -1 -1 -1 -1 -1 -1 -1

The spec file for training is as follows:

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  
  target_class_mapping {
    key: "person"
    value: "person"
  }

  target_class_mapping {
    key: "bicycle"
    value: "bicycle"
  }
  target_class_mapping {
    key: "car"
    value: "car"
  }
  target_class_mapping {
    key: "motorbike"
    value: "motorbike"
  }
  target_class_mapping {
    key: "bus"
    value: "bus"
  }
  target_class_mapping {
    key: "truck"
    value: "truck"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 1248
    output_image_height: 384
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "person"
    value {
      clustering_config {
        coverage_threshold:0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "bicycle"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.15000000596
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "car"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps:  0.20000000298
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }

  target_class_config {
    key: "motorbike"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "bus"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }

  target_class_config {
    key: "truck"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }


}
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/pretrained_resnet10/tlt_resnet10_detectnet_v2_v1/resnet10.hdf5"
  num_layers: 10
  use_batch_norm: true
  activation {
    activation_type: "relu"
  }
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "bicycle"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.699999988079
  }
  minimum_detection_ground_truth_overlap {
    key: "motorbike"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "bus"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "truck"
    value: 0.5
  }

  evaluation_box_config {
    key: "person"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "bicycle"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "car"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "motorbike"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }

  evaluation_box_config {
    key: "bus"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }

  evaluation_box_config {
    key: "truck"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  } 
    

  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "person"
    class_weight: 4.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "bicycle"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "motorbike"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }

  target_classes {
    name: "bus"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }

  target_classes {
    name: "truck"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }

  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 250
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "person"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "bicycle"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "car"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }

  target_class_config {
    key: "motorbike"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  target_class_config {
    key: "bus"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  target_class_config {
    key: "truck"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  deadzone_radius: 0.400000154972
}

Kindly let me know where I am going wrong. Please help me out.

Thanks.

Hi,

One or more metadata field(s) are missing from ground_truth batch_data, and will be replaced with defaults: ['frame/camera_location']

It looks like that the TLT toolkit cannot find the corresponding ground_truth for your evaluation dataset.
Could you check it first?

Thanks.

Hi AastaLLL,

I have replaced all the parameters with the value -1. From the documentation, I could understand that only the class name and bounding box corners (xmin , ymin, xmax , ymax) need to be provided.

Also, training and validation data are combined together with training data. Does that mean even training might be having some issues because of this ?

Kindly help me out if other ground truth data is required.

Thanks.

Hi neophyte1,
Could you please check “Model Requirements” at https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#requirements ?

For DetectNet_v2,the tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

Hi Morganh,

Thanks for your help. I checked the requirement for the same image size in the sample dataset pointed to for running the DetectNet_v2 sample. The images do not seem to be of the same size. For example,

training/image_2/000000.png - 1224x370
training/image_2/000001.png - 1242x375

How has the resizing been done if any ? Since we provide these images directly for recorder generation, it implies that images of different resolution are being passed to training tool.

Can you please help me understand ?

Thanks.

Hi noephyte1,
The KITTI dataset(1242x375,1238x374,1224x370,1241x376) almost matches spec (1248,384) but not exactly. During training, there is a crop step to crop them into the same size.If original image is smaller than model input size, then crop will become padding.

But you mentioned that your dataset is VOC dataset and COCO dataset(640x480).It is far away from (1248,384).
So for detectnet_v2, please resize them offline to the final training size.

Hi Morganh,

Where do we mention final training size ? Can we change (1248,384) to (480,480) for example? As advised, I am resizing all my images to 480x480 and then feeding it for training.

I made the change in the train config file as below :

augmentation_config {
  preprocessing {
    output_image_width: 480
    output_image_height: 480
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}

Is this correct?

Kindly let me know.

Thanks.

Yes,you can. Looks ok.W>=480,H>=272 and W,H are multiples 16.

Also,as tlt doc mentioned,the corresponding bounding boxes need be scaled accordingly.

Hi Morganh,

Thanks for your help. After training the model for 50 epochs, I get bizarre results. Only for car class I get a significantly low precision. Please find the results of evaluation below:

Validation cost: 0.000277
Mean average_precision (in %): 27.2255

class name      average precision (in %)
------------  --------------------------
bicycle                         18.8004
bus                             45.8717
car                              6.98904
motorbike                       28.2466
person                          44.9052
truck                           18.54

Median Inference Time: 0.004990

Please find my training spec file below:

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  
  target_class_mapping {
    key: "person"
    value: "person"
  }

  target_class_mapping {
    key: "bicycle"
    value: "bicycle"
  }
  target_class_mapping {
    key: "car"
    value: "car"
  }
  target_class_mapping {
    key: "motorbike"
    value: "motorbike"
  }
  target_class_mapping {
    key: "bus"
    value: "bus"
  }
  target_class_mapping {
    key: "truck"
    value: "truck"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 480
    output_image_height: 480
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "person"
    value {
      clustering_config {
        coverage_threshold:0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "bicycle"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.15000000596
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "car"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps:  0.20000000298
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }

  target_class_config {
    key: "motorbike"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "bus"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }

  target_class_config {
    key: "truck"
    value {
      clustering_config {
        coverage_threshold: 0.00749999983236
        dbscan_eps: 0.230000004172
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }


}
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/pretrained_resnet10/tlt_resnet10_detectnet_v2_v1/resnet10.hdf5"
  num_layers: 10
  use_batch_norm: true
  activation {
    activation_type: "relu"
  }
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "bicycle"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.699999988079
  }
  minimum_detection_ground_truth_overlap {
    key: "motorbike"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "bus"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "truck"
    value: 0.5
  }

  evaluation_box_config {
    key: "person"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "bicycle"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "car"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "motorbike"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }

  evaluation_box_config {
    key: "bus"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }

  evaluation_box_config {
    key: "truck"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  } 
    

  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "person"
    class_weight: 4.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "bicycle"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "motorbike"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }

  target_classes {
    name: "bus"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }

  target_classes {
    name: "truck"
    class_weight: 8.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }

  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 50
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "person"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "bicycle"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "car"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  target_class_config {
    key: "motorbike"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  target_class_config {
    key: "bus"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  target_class_config {
    key: "truck"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }

  deadzone_radius: 0.400000154972
}

What do you suspect could be the issue?

Thanks.

Hi Morganh,

How to change the parameters of the training config file based on the training input image size ? Is there any documentation explaining how to customize the parameters? Probably that is the reason for the terrible performance of detector for some classes.

Thanks.

Hi neophyte1,
What’s the car’s ratio in your dataset?

Hi Morganh,

Following are the statistics as reported while creating tfrecords. The total number is also in the same ratio:

bicycle: 4305
motorbike: 5417
car: 26842
person: 154889
truck: 5931
bus: 3699

Thanks

Hi neophyte1,
Thanks for the info. Is it possbile to narrow down the low map for car via more experiments?

  1. Could you retrain with batch size 4 and epochs 120? Your bs is 16.
    or 2) Train only 3 classes: person/bicycle/car
    or 3) change the (480,480) to other resolution.

More, could you check the correctness of all the labels? And make sure the data and label are matched.

Hi Morganh,

Thanks for the pointers.

  1. I am using multiple GPUs - 4 of them with batch size 16 per GPU currently. I have tried with batch size 6 per GPU as well. However, the results were not improving. Should I try with batch size 4 per GPU with 4 GPUs or overall batch size of 4 ?

  2. Yesterday, I clubbed motorbike and bicycle class to cyclist as given in the example config and car, bus and truck to car using class mapping in the config. I used the same config parameters as given in the sample. However, the performance deteriorated. Please note that with sample KITTI dataset, the mean average precision is quite high for all 3 classes. I will try just using bicycle, person and car without clubbing and will let you know the results.

  3. Should I try (640,480) or (720,480) as in the sample the size is (1248,384). May be I should not feed square input size?

I will recheck correctness of all labels. However, I have visualized multiple times to make sure the data and labels are matched. I can upload some sample images and labels if you wish to cross check. Please let me know.

Thanks.

Hi Morganh,

Should the “load_graph” parameter be set to true or false in training config file ?

Please let me know.

Thanks.

Hi neophyte1,
More pointers are as below. You can do more experiments via one pointer or several.

  1. Could you check raw image size from your VOC/COCO dataset? And calculate what is raw image aspect ratio? If training spec’s width/height changes too much for size and aspect ratio, the result will not be good.

  2. Does the dataset have a lot of small cars and trucks? If the targets are small, may expect small AP.

  3. In your spec, car’s class_weight is too small.Expect to increase weight.
    Person class_weight 4 , bbox weight 10
    Bicycle class_weight 8 , bbox weight 1
    Car class_weight 1 , bbox weight 10
    Motorbike class_weight 8 , bbox weight 1
    Bus class_weight 8 , bbox weight 1
    Truck class_weight 8 , bbox weight 1

  4. minimum_bounding_box_height: 20
    Could it reduce to 10? if there are a lot of small targets, this filters out them.

  5. minimum_detection_ground_truth_overlap {
    key: “car”
    value: 0.699999988079
    }
    Car IoU threshold is 0.7 while all others is 0.5 during evaluation.

More,we have no explicit guidance in the doc about how to tune hyper-parameters. More experiments are expected.The “load_graph” cat set to false by default in training config file. But for a pruned, please remember to set this parameter as True. See tlt doc for details.

Hi Morganh,

Thanks for the pointers. Many of the pointers worked. However, I still have doubts regarding the parameter of batch size. Following are some of the results I performed on just VOC dataset for 3 classes for 22 epochs with default configuration:

No. of GPUs : 4 
Batch Size per GPU : 4
Average Precision (%):

bicycle : 3.61833
car : 0
person : 12.6573
No. of GPUs : 1
Batch Size per GPU : 4
Average Precision (%):

bicycle : 30
car : 0.5
person : 28

Please do not focus on precision of car class as I did not tweak the parameters as suggested by you for “car” class in this experiment. I have fixed the precision for car class by adding more data from coco dataset and tweaking the parameters as suggested by you.

Kindly let me know if these observations seem correct. If the results are to be believed, then I have the following queries:

  1. How to make the training work for multiple gpus ?
  2. How to make the training work for greater batch size per gpu ?

Please let me know and thanks for the pointers again.

Thanks.

Hi neophyte1,
1)See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#training_models ,
tlt-train command supports multiGPU training. You can invoke a multi GPU training session by using the --gpus N option, where N is the number of GPUs you want to use. N must be less than the number of GPUs available in the given node for training.
2) batch_size_per_gpu can be configured in spec file.

Hi Morganh,

Let me update you with my progress. I am first trying to achieve accuracy. Hence, I opted for Resnet-18 backbone. After training, I got really impressive results after following your guidelines. However, I do not understand how to prune and retrain for Resnet-18 backbone for my dataset. Somehow, pruning and retraining was successful with Resnet-10 backbone using the parameter of “prune threshold” set in the example. When I use the same value for pruning and retraining Resnet-18 model I get terrible results. Following are the results of pruning and retraining using pth = 5.2e-6.

Results of training :

Validation cost: 0.002584
Mean average_precision (in %): 30.4075

class name      average precision (in %)
------------  --------------------------
bicycle                          10.2874
car                              32.4107
person                           48.5244

Median Inference Time: 0.007108

Results after pruning and retraining:

class name      average precision (in %)
------------  --------------------------
bicycle                          0
car                              1.67
person                           25.30

Please help me out to set the right parameters.

Thanks.