Help with Detectnet_V2 train config file (TAO)

Overview:
I have been using TAO to train custom, single-class detectnet_v2 networks with a resnet18 backbone on 1080p RGB images. This is the object/target that I am training on:


While the networks are not perfect, I have great success deploying them for our use case. However, there are a few issues/cases I am running into that I would like to fix.

Behavior:
When the object/target is far away/small, the network renders a near-perfect bounding box encapsulating the target:

However, as the target gets closer, the neural network loses detection completely or begins to “split” the target:


Current Improvement:
Over the last couple days, I have been trying to learn about all the different parameters in the training config for Detectnet_V2 with some success. My training config file now looks like this:

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords_target/kitti_train/*"
    image_directory_path: "/workspace/tlt-experiments/data/Set_target/training"
  }
  image_extension: "png"
  target_class_mapping {
    key: "target"
    value: "target"
  }
    
  validation_data_source: {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords_target/kitti_val/*"
    image_directory_path: "/workspace/tlt-experiments/data/Set_target/val"
  }
}
augmentation_config {
  preprocessing {
    output_image_width: 1920 
    output_image_height: 1088 
    min_bbox_width: 8.0
    min_bbox_height: 8.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 32.0
    translate_max_y: 32.0
    rotate_rad_max: 0.69
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.25
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "target"
    value {
      clustering_config {
        clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.5
        coverage_threshold: 0.005
        dbscan_eps: 0.7
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 8
      }
    }
  }
}
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/resnet18.hdf5"
  freeze_blocks: 0
  freeze_blocks: 1
  num_layers: 18
  use_pooling: False
  use_batch_norm: true
  dropout_rate: 0.5
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 5
  first_validation_epoch: 30
  minimum_detection_ground_truth_overlap {
    key: "target"
    value: 0.6
  }
  evaluation_box_config {
    key: "target"
    value {
      minimum_height: 8
      maximum_height: 1088
      minimum_width: 8
      maximum_width: 1920
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "target"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}
training_config {
  batch_size_per_gpu: 4
  num_epochs: 40
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 2e-06
      max_learning_rate: 2e-05
      soft_start: 0.1
      annealing: 0.6
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 25
}
bbox_rasterizer_config {
  target_class_config {
    key: "target"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.2
}

Changes from previous config file to this one:

  • Epochs: 60 → 40
    I was worried about the model overfitting on a dataset of only 60k images.

  • Freeze blocks: 0,1,2 → 0,1

  • dbscan_eps: 0.3 → 0.7
    Since the network was seeming to splice the detection, I thought that it may have been due to detections not being clustered together properly, so I increased this per the description here (DetectNet_v2 — TAO Toolkit 3.22.05 documentation).

  • deadzone_radius: 0.6 → 0.2
    Since the target is a circle and the bounding box should ideally circumscribe the target/circle, I calculated the deadzone_radius as (1- (circle_area_of_radius_r / square_area_of_width_2r)) = 0.2 to give the area inside the bounding box that is not the target.

  • cov_radius_x: 0.5 → 1.0

  • cov_radius_y: 0.5 → 1.0
    Since the bounding box should ideally circumscribe the target, the coverage radius for x and y should be 1.0

  • vflip_probability: 0.0 → 0.5

If my reasoning for changing any of these parameters is wrong, please correct me. Additionally, I have also been trying to look into coverage_foreground_weight, but the explanation (Tlt spec file - cost function - #4 by Morganh) confuses me as to what coverage_foreground_weight is supposed to represent

The neural network trained on this config file (using the same dataset as before) was able to track the target when it was larger/closer and fixed some of the “splitting”. Here are some outputs:
(1)


(2)

(3)

Image (1) shows improvement in the “splitting” but still does not encompass the entire target.

Image (2) shows that the new/improved network is able to detect on a larger/closer target but still exhibits the same issues as (1), but worse. The splitting gets worse as it gets closer and closer/larger and larger

Image (3) is on a sub-class of the target it has also been trained to detect and redemonstrates what (1) shows but on a different target. The red bounding box is the output from the previous neural network and the green bounding box is the output from the current neural network

Questions and Help:
If you could provide any guidance or critique of the train config file or other parts of the training process to help remedy any of the following issues:

  1. Detection splitting when too close
  2. Wrong dimensions detection when too close
  3. No detections when too close

Additional Info:
All example images from the network output have been cut from their original 1080p image for internal reasons. If so desired, I can provide the full image in a private context.

Our dataset is roughly 60k 1080p RGB images hand-labeled in the KITTI format with just the class name and bounding box fields being non-zero. While the dataset does not include a lot of close-up/large images of the target, I would still expect it to be able to. Here are some data on the distribution of target bounding boxes in the dataset:

Width: Mean=103.318 px, Min=14, Max=1006
Height: Mean=75.932 px, Min=6, Max=1076

Width/Height Distribution:
Bounding Box Width/Height Distribution Histogram
Bounding Box Positional Heatmap Distribution

Does the tendency for the bounding box to be in the middle of the image and/or be smaller (100-200 width) have an effect on training? If so, can this be resolved with the augmentation_config’s zoom_min/max and translate_max_x/y properties?

(I did discover just now when grabbing these statistics that there were some wrong bounding boxes (like <25 in a dataset of 60k) so I will be retraining this weekend just to be sure)

I got similar problems, for detecting closer object, any solutions?