Relationship between training dataset size and inference data size

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Ubuntu, x86, RTX3090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I’m using TAO to retraining my custom model based on detectnet-v2 (resnet18).
Context 1:
My private dataset original images are varies, different ratio and resolution, I resized them all to resolution 800x608 by a image resize tool for compatible with the requirement of training in TAO.
Question 1:
the image resize tool are ratio based not crop, thus an image (or objects in it) could be distorted, does this impact the later inference? or I mis-understanding anything?

Context 2:
After export the model for start inference in deepstream6, I prepared a 1920x1080 video file in local, also noticed there’s an parameter: input-dims (channel; height; width; input-order All integers, ≥0) in pgie-config-file, I put value 3;1080;1920;0 and ran the app, by my eye, can see the accuracy is pretty bad as many False positive bounding boxes(the box report target object in a actual empty area) were showing, but if I change the value to 3;608;800;0, then the accuracy is much better.
Question 2:
What and when I should change the value for parameter: input-dims as the inference source resolution could be varies(from different camera)?

Question 3:
I even noticed, for a same inference source(like a rtsp stream), I keep the ratio but input different width and height with scale into input-dims, also can cause huge different detection accuracy.

This may affect the training result and inference. How about the mAP after training?

Can you share your training spec? What is the model width and height? The input dims depends on it.

The input-dims should not change.

  1. The varies resolution of original images are all resized (keep ratio) to 800x608 firstly and then put into image_2 and label_2, and the traning validation will be against on these resized images as well, correct? and I can see the mAP is good both in training and re-training(pruned) stage, below is the training mAP:

Validation cost: 0.000132
Mean average_precision (in %): 85.1225

class name average precision (in %)


door_warning_sign 80.8844
electric_bicycle 80.9867
people 93.4964

detectnet_v2_tfrecords_kitti_trainval.txt

TFrecords conversion spec file for kitti training
kitti_config {
root_directory_path: “/workspace/tao-experiments/data/training”
image_dir_name: “image_2”
label_dir_name: “label_2”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 6
num_shards: 10
}

detectnet_v2_train_resnet18_kitti.txt:

random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/tao-experiments/data/tfrecords/kitti_trainval/*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
image_extension: “jpg”
target_class_mapping {
key: “door_warning_sign”
value: “door_warning_sign”
}
target_class_mapping {
key: “people”
value: “people”
}
target_class_mapping {
key: “electric_bicycle”
value: “electric_bicycle”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 800
output_image_height: 608
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “door_warning_sign”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 10
}
}
}
target_class_config {
key: “people”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “electric_bicycle”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}
model_config {
pretrained_model_file: “/workspace/tao-experiments/detectnet_v2/pretrained_resnet18/pretrained_detectnet_v2_vresnet18/resnet18.hdf5”
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
arch: “resnet”
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 20
minimum_detection_ground_truth_overlap {
key: “door_warning_sign”
value: 0.4
}
minimum_detection_ground_truth_overlap {
key: “people”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “electric_bicycle”
value: 0.5
}
evaluation_box_config {
key: “door_warning_sign”
value {
minimum_height: 10
maximum_height: 9999
minimum_width: 14
maximum_width: 9999
}
}
evaluation_box_config {
key: “people”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “electric_bicycle”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “door_warning_sign”
class_weight: 10.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “people”
class_weight: 5.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: “electric_bicycle”
class_weight: 5.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 4
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: “door_warning_sign”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
target_class_config {
key: “people”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “electric_bicycle”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

since the training and validation are all based on ratio resized images, does this mean the model may learned the distorted objects, correct?

  1. for my scenario, when do the inference, the video sources may have different resolutions, here by hand, a camera is with resolution 1280x960, what is the recommend input-dims values?

Yes.

Yes.

Just set 3;608;800;0

The target objects ratio in training dataset should keep as much as possible to the inference video source, correct?

which way does the input-dims do the resize? keep ratio or crop(padding)? this should align with training dataset resize algorithm, correct?

I think you are running deepstream. So, need not to care the resizing for the test video.

but different values in input-dims greatly impact the detection result, I’m testing with my own video.

The input-dims cannot be changed. It is related to your model.
Can only set to 3;608;800;0 according to your training spec.

thanks Morgan.

Since my inference camera video source resolution is a fix value (now is 1280x960), does this imply I can adjust my training dataset with all resize to 1280x960 as well, and could be helpful for improve inference detection accuracy?

Yes, you can train a new model.
Just enable_auto _resize and change output_image_width and output_image_height in the training spec.

enable_auto_resize: true
output_image_width: 1280
output_image_height: 960

Refer to DetectNet_v2 — TAO Toolkit 3.22.05 documentation

thansk morgan,
the key point is the training dataset should keep the image size the same as inference source as much as possible, correct?

Usually it will be better to run inference against test dataset which is similar to training dataset.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.