Overview:
I have been using TAO to train custom, single-class detectnet_v2 networks with a resnet18 backbone on 1080p RGB images. This is the object/target that I am training on:
While the networks are not perfect, I have great success deploying them for our use case. However, there are a few issues/cases I am running into that I would like to fix.
Behavior:
When the object/target is far away/small, the network renders a near-perfect bounding box encapsulating the target:
However, as the target gets closer, the neural network loses detection completely or begins to “split” the target:
Current Improvement:
Over the last couple days, I have been trying to learn about all the different parameters in the training config for Detectnet_V2 with some success. My training config file now looks like this:
random_seed: 42
dataset_config {
data_sources {
tfrecords_path: "/workspace/tlt-experiments/data/tfrecords_target/kitti_train/*"
image_directory_path: "/workspace/tlt-experiments/data/Set_target/training"
}
image_extension: "png"
target_class_mapping {
key: "target"
value: "target"
}
validation_data_source: {
tfrecords_path: "/workspace/tlt-experiments/data/tfrecords_target/kitti_val/*"
image_directory_path: "/workspace/tlt-experiments/data/Set_target/val"
}
}
augmentation_config {
preprocessing {
output_image_width: 1920
output_image_height: 1088
min_bbox_width: 8.0
min_bbox_height: 8.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 32.0
translate_max_y: 32.0
rotate_rad_max: 0.69
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.25
contrast_scale_max: 0.1
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: "target"
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.5
coverage_threshold: 0.005
dbscan_eps: 0.7
dbscan_min_samples: 0.05
minimum_bounding_box_height: 8
}
}
}
}
model_config {
pretrained_model_file: "/workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/resnet18.hdf5"
freeze_blocks: 0
freeze_blocks: 1
num_layers: 18
use_pooling: False
use_batch_norm: true
dropout_rate: 0.5
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
arch: "resnet"
}
evaluation_config {
validation_period_during_training: 5
first_validation_epoch: 30
minimum_detection_ground_truth_overlap {
key: "target"
value: 0.6
}
evaluation_box_config {
key: "target"
value {
minimum_height: 8
maximum_height: 1088
minimum_width: 8
maximum_width: 1920
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: "target"
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: "cov"
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: "bbox"
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}
training_config {
batch_size_per_gpu: 4
num_epochs: 40
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 2e-06
max_learning_rate: 2e-05
soft_start: 0.1
annealing: 0.6
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 25
}
bbox_rasterizer_config {
target_class_config {
key: "target"
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.2
}
Changes from previous config file to this one:
-
Epochs: 60 → 40
I was worried about the model overfitting on a dataset of only 60k images. -
Freeze blocks: 0,1,2 → 0,1
-
dbscan_eps: 0.3 → 0.7
Since the network was seeming to splice the detection, I thought that it may have been due to detections not being clustered together properly, so I increased this per the description here (DetectNet_v2 — TAO Toolkit 3.22.05 documentation). -
deadzone_radius: 0.6 → 0.2
Since the target is a circle and the bounding box should ideally circumscribe the target/circle, I calculated the deadzone_radius as(1- (circle_area_of_radius_r / square_area_of_width_2r)) = 0.2
to give the area inside the bounding box that is not the target. -
cov_radius_x: 0.5 → 1.0
-
cov_radius_y: 0.5 → 1.0
Since the bounding box should ideally circumscribe the target, the coverage radius for x and y should be 1.0 -
vflip_probability: 0.0 → 0.5
If my reasoning for changing any of these parameters is wrong, please correct me. Additionally, I have also been trying to look into coverage_foreground_weight
, but the explanation (Tlt spec file - cost function - #4 by Morganh) confuses me as to what coverage_foreground_weight
is supposed to represent
The neural network trained on this config file (using the same dataset as before) was able to track the target when it was larger/closer and fixed some of the “splitting”. Here are some outputs:
(1)
(2)
(3)
Image (1) shows improvement in the “splitting” but still does not encompass the entire target.
Image (2) shows that the new/improved network is able to detect on a larger/closer target but still exhibits the same issues as (1), but worse. The splitting gets worse as it gets closer and closer/larger and larger
Image (3) is on a sub-class of the target it has also been trained to detect and redemonstrates what (1) shows but on a different target. The red bounding box is the output from the previous neural network and the green bounding box is the output from the current neural network
Questions and Help:
If you could provide any guidance or critique of the train config file or other parts of the training process to help remedy any of the following issues:
- Detection splitting when too close
- Wrong dimensions detection when too close
- No detections when too close
Additional Info:
All example images from the network output have been cut from their original 1080p image for internal reasons. If so desired, I can provide the full image in a private context.
Our dataset is roughly 60k 1080p RGB images hand-labeled in the KITTI format with just the class name and bounding box fields being non-zero. While the dataset does not include a lot of close-up/large images of the target, I would still expect it to be able to. Here are some data on the distribution of target bounding boxes in the dataset:
Width: Mean=103.318 px, Min=14, Max=1006
Height: Mean=75.932 px, Min=6, Max=1076
Width/Height Distribution:
Does the tendency for the bounding box to be in the middle of the image and/or be smaller (100-200 width) have an effect on training? If so, can this be resolved with the augmentation_config’s zoom_min/max
and translate_max_x/y
properties?
(I did discover just now when grabbing these statistics that there were some wrong bounding boxes (like <25 in a dataset of 60k) so I will be retraining this weekend just to be sure)