Training: Nan or 0 mAP, Detectnet_v2, Training on Tutorial Spec

Sneaky_Turtle · January 16, 2021, 1:09pm

I am working on Amazon AWS in an EC2 instance. My dataset is Caltech-Birds-201, padded to 512x512. My loss rates are beyond a thousandth of a percentile, my mAP starts as Nan and eventually becomes 0, and I have output like this:

Validation cost: -0.000010
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
bird                                   0

Median Inference Time: 0.014849
2021-01-16 13:01:12,535 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 71/80: loss: 0.00001 Time taken: 0:05:19.982588 ETA: 0:47:59.843290
2021-01-16 13:01:24,731 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 4.704

79/80th Epoch:

2021-01-16 13:44:47,171 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 79/80: loss: 0.00001 Time taken: 0:04:11.581900 ETA: 0:04:11.581900

KITTI Config:

kitti_config {
root_directory_path: “/data”
image_dir_name: “images”
label_dir_name: “labels”
image_extension: “.jpeg”
partition_mode: “random”
num_partitions: 2
val_split: 20
num_shards: 10
}
image_directory_path: “/data/images”

Example KITTI label:

bird 0.0 0 0.0 60.0 27.0 385.0 331.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Spec File:

model_config {
arch: “resnet”
pretrained_model_file: “/data/tlt_resnet50_detectnetv2_v1/resnet50.hdf5”
freeze_blocks: 0
freeze_blocks: 1
all_projections: True
num_layers: 18
use_pooling: False
use_batch_norm: True
dropout_rate: 0.0
training_precision: {
backend_floatx: FLOAT32
}
objective_set: {
cov {}
bbox {
scale: 35.0
offset: 0.5
}
}
}

bbox_rasterizer_config {
target_class_config {
key: “bird”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.67
}

postprocessing_config {
target_class_config {
key: “bird”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
}

cost_function_config {
target_classes {
name: “bird”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: True
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}

training_config {
batch_size_per_gpu: 16
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
enabled: False
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
}

augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
color_shift_stddev: 0.0
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
}
}

evaluation_config {
average_precision_mode: INTEGRATE
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “bird”
value: 0.7
}

evaluation_box_config {
key: “bird”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}

dataset_config {
data_sources: {
tfrecords_path: “/data/tfrecords/*”
image_directory_path: “/data/images/”
}
image_extension: “jpg”
target_class_mapping {
key: “bird”
value: “bird”
}
validation_fold: 0
}

I am thinking it has to do with my specfile. I am rather new to using the TLT and trying to learn using my own single class dataset. Can someone have a look and advise me as to the best course of action to solve this?

spolisetty · January 18, 2021, 10:32am

Hi @Sneaky_Turtle,

This issue is not related TRT. Please post your concern in TLT forum.

Thank you.

Topic		Replies	Views
mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18)) TAO Toolkit	33	536	February 1, 2024
tlt-train detectnet V2 mean average precision always 0 % in every target class TAO Toolkit	5	1065	October 12, 2021
Improving mAP of License Plate Detection TAO Toolkit	10	510	October 12, 2021
During training, the mAP value becomes 0 TAO Toolkit	2	762	October 12, 2021
mAP=0 error TAO Toolkit tensorrt , ai-training	7	1380	October 12, 2021
Getting 60% mAP with 25 epochs in AutoML vs. 43% with 100 epochs without AutoML, using the same spec TAO Toolkit	4	291	March 21, 2024
Error detectnet_V2 train with TAO : dbscan_min_samples: 0.05' TAO Toolkit tao	4	403	November 7, 2023
Tao Training Detectnet_v2 custom dataset : Average precision value 0.0000% TAO Toolkit	5	222	June 25, 2024
Faster RCNN on TLT 3.0 not learning the same as TLT 2.0 TAO Toolkit	15	1028	October 12, 2021
Problem of tao detectnet_v2 evaluate 0% TAO Toolkit python	21	423	July 7, 2023

Training: Nan or 0 mAP, Detectnet_v2, Training on Tutorial Spec

Related topics