Troubles Replicating TLT Model Training Experiment with TAO

Hello,
I have previously trained a model in this docker container: “nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3” I used the following command and spec_file:

tlt-train detectnet_v2 -e tlt_experiment_spec.txt -r <output_dir> -k <key_to_load_the_model>

dataset_config {
data_sources {
tfrecords_path: “/workspace/data/tf_records/*”
image_directory_path: “/workspace/linked_data/”
}
image_extension: “jpg”
target_class_mapping {
key: “motorcycle”
value: “motorcycle”
}
target_class_mapping {
key: “vehicle”
value: “vehicle”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
min_bbox_width: 1.0
min_bbox_height: 25.0
output_image_channel: 3
}
spatial_augmentation {
zoom_min: 1.0
zoom_max: 1.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.2199999988079071
contrast_scale_max: 0.1599999964237213
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “motorcycle”
value {
clustering_config {
coverage_threshold: 0.20000000298023224
minimum_bounding_box_height: 20
dbscan_eps: 0.5
dbscan_min_samples: 1
}
}
}
target_class_config {
key: “vehicle”
value {
clustering_config {
coverage_threshold: 0.20000000298023224
minimum_bounding_box_height: 20
dbscan_eps: 0.5
dbscan_min_samples: 1
}
}
}
}
model_config {
pretrained_model_file: “path_to_resnet34_peoplenet.tlt”
num_layers: 34
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
freeze_blocks: 0.0
arch: “resnet”
all_projections: true
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 10
minimum_detection_ground_truth_overlap {
key: “motorcycle”
value: 0.800000011920929
}
minimum_detection_ground_truth_overlap {
key: “vehicle”
value: 0.800000011920929
}
evaluation_box_config {
key: “motorcycle”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
evaluation_box_config {
key: “vehicle”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “vehicle”
class_weight: 1.0
coverage_foreground_weight: 0.05000000074505806
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “motorcycle”
class_weight: 1.0
coverage_foreground_weight: 0.05000000074505806
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.9998999834060669
min_objective_weight: 9.999999747378752e-05
}
training_config {
batch_size_per_gpu: 16
num_epochs: 1100
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 4.999999873689376e-06
max_learning_rate: 0.0005000000237487257
soft_start: 0.20000000298023224
annealing: 0.800000011920929
}
}
regularizer {
weight: 3.000000026176508e-09
}
optimizer {
adam {
epsilon: 9.99999993922529e-09
beta1: 0.8999999761581421
beta2: 0.9990000128746033
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 5
}
bbox_rasterizer_config {
target_class_config {
key: “motorcycle”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4000000059604645
cov_radius_y: 0.4000000059604645
bbox_min_radius: 1.0
}
}
target_class_config {
key: “vehicle”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4000000059604645
cov_radius_y: 0.4000000059604645
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.6700000166893005
}

This spec file and training command resulted in a great model with Vehicle AP ~90%. I was hoping to move to a newer version of TLT (TAO) so I took the data, spec file, basemodel, and tried to repeat the same experiment using the newest docker image.

The image i used for my second experiment is nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5. I regenerated the tf_records and then ran the experiment. I used the exact same spec file that is seen above, the command I used to start training can be seen below.

detectnet_v2 train -e tlt_experiment_spec.txt -r <output_dir> -k <key_to_load_the_model>

The results of this experiment were not nearly as good. Vehicle AP only got to ~15%. I tried to mess with some of the parameters like the learning rate and cost function. The best Vehicle AP I could get so far was 40%.

The tlt experiment was ran on a machine with a Quadro RTX 5000 GPU. I have ran the TAO experiments on that machine and another with a NVIDIA GeForce RTX 4080.

I am happy to continue to tune the model to raise the performance but I would expect to have the same results when switching to the newer framework. I have read lots of documentation to see if there is a configuration that I am missing or maybe there is another step that I have skipped for the new framework but I haven’t found anything.

My main goal is to try to replicate the results that I got using the old version of tlt. Any information or documentation references on why the results would be so different would be appreciated!

1 Like

Could you use the 1st experiment’s tf_records files to run the 2nd experiment?

1 Like

To help with troubleshooting, I have include some graphs to show the behavior that we would expect. ( sorry that they aren’t all graphed with the same utility, for this project we are jumping around to different machines and the visualization tools across different machines are not consistent)

Here is the vehicle AP of the original experiment (Trained using TLT with TLT generated records):


(the graph says mAP but it is actually vehicle AP)
The precision is a nice logarithmic curve getting as high at 96% vehicle AP. These are the results that we hope to reproduce with TAO.

1 Like

Here is the vehicle AP of the second experiment. (Trained using TAO with the exact same config as before):
Screenshot 2023-11-20 at 11.35.44 AM
As you can see the model barely gets to 30%.

1 Like

Over the weekend, I trained a new model using the new image (TAO) but with the tf_records generated from the old image (TLT). This experiment was conducted on the machine with a Quadro RTX 5000 GPU.
The initial results of the experiment were promising, with the Vehicle_AP growing quickly, surpassing the best performance when using TAO tf_records. But after peaking at ~46%, the model performance dropped off and ended up at 0.

Here is the vehicle AP of the model that was trained using TAO but with TLT generated records:

While these aren’t the best results, there is some improvement which is promising!

The goal is still to replicate the TLT results in TAO so any other possible experiments would be helpful but some new questions have arisen:
What is the difference between the TLT and TAO tf_record generation?
Is there different versions of the basemodel for TLT and TAO?
Are there different default values for the training_configs between TLT and TAO?
Is there enough differences between TLT and TAO that it is not possible to replicate experiments across different versions like I am trying to do?

1 Like

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

For tfrecord generation in TAO, there are some difference mentioned in
https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/detectnet_v2/dataloader/build_dataloader.py#L251-L276.

Do you mean the pretrained model? For peoplenet model you are using, see
PeopleNet | NVIDIA NGC, there are different version of unpruned models.

Actually for detectnet_v2 network, there are not much changes in training config. In TAO, there is “enable_auto _resize”. You can set it to true. More info can be found in DetectNet_v2 - NVIDIA Docs.
It is a flag to enable automatic resize during training. When it is set to True, offline resize before the training is no longer required. Enabling this will potentially increase the training time.

To narrow down, I suggest you to run with KITTI dataset mentioned in the detectnet_v2 notebook to check if there is still the same behavior. For KITTI dataset, all the images are 1248x384.
For TAO, the spec is https://github.com/NVIDIA/tao_tutorials/blob/95aca39c79cb9068593a6a9c3dcc7a509f4ad786/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.