0 mAP over 50 Epochs while training TLT DetectNet_v2 MobileNet_v2

Hi, I am trying to train MobileNet_v2.hdf5 which is from DetectNet_v2 using tlt-train. After running ‘tlt-train’ command, the mAP does not increase from 0 over 120 epochs. I followed the guideline and created the config file from here (https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#spec_file_gridbox_topic). In addition, I have already trained and exported TrafficCamNet successfully, and deployed it on Jetson AGX Xavier. I am not sure if it is something wrong with mobilenet itself or with my config. I am attaching the log and command line and config files. I do appreciate your help. Thank you.

Command line

tlt-train detectnet_v2 -e /workspace/models/mobile_v2_config.txt -r /workspace/result_mobile_lr -k sycho --gpus 1

Config file

model_config {
arch: “mobilenet_v2”
pretrained_model_file: “/workspace/models/mobilenet_v2.hdf5”
freeze_blocks: 0
freeze_blocks: 1
all_projections: True
use_pooling: False
use_batch_norm: True
dropout_rate: 0.0
training_precision: {
backend_floatx: FLOAT32
}
objective_set: {
cov {}
bbox {
scale: 35.0
offset: 0.5
}
}
}

training_config {
batch_size_per_gpu: 8
num_epochs: 120
enable_qat: true
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
enabled: False
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
}

bbox_rasterizer_config {
target_class_config {
key: “car”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.67
}

augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {

hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0

}
color_augmentation {
color_shift_stddev: 0.0
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
}
}

evaluation_config {
average_precision_mode: INTEGRATE
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “car”
value: 0.7
}

evaluation_box_config {
key: “car”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}

postprocessing_config {
target_class_config {
key: “car”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
}

cost_function_config {
target_classes {
name: “car”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
enable_autoweighting: True
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}

dataset_config {
data_sources: {
tfrecords_path: “/workspace/data/tf_records/*”
image_directory_path: “/workspace/data/day_all”
}
image_extension: “jpg”
target_class_mapping {
key: “car”
value: “car”
}
validation_fold: 0
}

Log

Using TensorFlow backend.
2020-09-23 09:43:34,925 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/models/mobile_v2_config.txt.
2020-09-23 09:43:34,927 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/models/mobile_v2_config.txt
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infr$
/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infr$
/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infr$
/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infr$
/iva/detectnet_v2/scripts/train.py”, line 680, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infr$
/iva/detectnet_v2/model/utilities.py”, line 81, in get_pretrained_model_path
AssertionError: Pretrained model file not found: /workspace/models/mobilenet_v2.tlt
root@7293f29a2091:/workspace# tlt-train detectnet_v2 -e /workspace/models/mobile_v2_config.txt -r /workspace/result_mobile -k sycho --gpus 1
Using TensorFlow backend.
2020-09-23 09:43:55,108 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/models/mobile_v2_config.txt.
2020-09-23 09:43:55,111 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/models/mobile_v2_config.txt
2020-09-23 09:43:55,211 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 630 samples with a batch size of 8; each epoch will therefore take one extra step.


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 544, 960) 0


conv1_pad (ZeroPadding2D) (None, 3, 546, 962) 0 input_1[0][0]


conv1 (Conv2D) (None, 32, 272, 480) 864 conv1_pad[0][0]


bn_conv1 (BatchNormalization) (None, 32, 272, 480) 128 conv1[0][0]


re_lu_1 (ReLU) (None, 32, 272, 480) 0 bn_conv1[0][0]


expanded_conv_depthwise_pad (Ze (None, 32, 274, 482) 0 re_lu_1[0][0]


expanded_conv_depthwise (Depthw (None, 32, 272, 480) 288 expanded_conv_depthwise_pad[0][0]


expanded_conv_depthwise_bn (Bat (None, 32, 272, 480) 128 expanded_conv_depthwise[0][0]


expanded_conv_relu (ReLU) (None, 32, 272, 480) 0 expanded_conv_depthwise_bn[0][0]


expanded_conv_project (Conv2D) (None, 16, 272, 480) 512 expanded_conv_relu[0][0]


expanded_conv_project_bn (Batch (None, 16, 272, 480) 64 expanded_conv_project[0][0]


block_1_expand (Conv2D) (None, 96, 272, 480) 1536 expanded_conv_project_bn[0][0]


block_1_expand_bn (BatchNormali (None, 96, 272, 480) 384 block_1_expand[0][0]


re_lu_2 (ReLU) (None, 96, 272, 480) 0 block_1_expand_bn[0][0]


block_1_depthwise_pad (ZeroPadd (None, 96, 274, 482) 0 re_lu_2[0][0]


block_1_depthwise (DepthwiseCon (None, 96, 136, 240) 864 block_1_depthwise_pad[0][0]


block_1_depthwise_bn (BatchNorm (None, 96, 136, 240) 384 block_1_depthwise[0][0]


block_1_relu (ReLU) (None, 96, 136, 240) 0 block_1_depthwise_bn[0][0]


block_1_project (Conv2D) (None, 24, 136, 240) 2304 block_1_relu[0][0]


block_1_project_bn (BatchNormal (None, 24, 136, 240) 96 block_1_project[0][0]


block_2_expand (Conv2D) (None, 144, 136, 240 3456 block_1_project_bn[0][0]


block_2_expand_bn (BatchNormali (None, 144, 136, 240 576 block_2_expand[0][0]


re_lu_3 (ReLU) (None, 144, 136, 240 0 block_2_expand_bn[0][0]


block_2_depthwise_pad (ZeroPadd (None, 144, 138, 242 0 re_lu_3[0][0]


block_2_depthwise (DepthwiseCon (None, 144, 136, 240 1296 block_2_depthwise_pad[0][0]


block_2_depthwise_bn (BatchNorm (None, 144, 136, 240 576 block_2_depthwise[0][0]


block_2_relu (ReLU) (None, 144, 136, 240 0 block_2_depthwise_bn[0][0]


block_2_project (Conv2D) (None, 24, 136, 240) 3456 block_2_relu[0][0]


block_2_projected_inputs (Conv2 (None, 24, 136, 240) 576 block_1_project_bn[0][0]


block_2_project_bn (BatchNormal (None, 24, 136, 240) 96 block_2_project[0][0]


block_2_add (Add) (None, 24, 136, 240) 0 block_2_projected_inputs[0][0]
block_2_project_bn[0][0]
__________________________________________________________________________________________________ [1461/1818]
block_3_expand (Conv2D) (None, 144, 136, 240 3456 block_2_add[0][0]


block_3_expand_bn (BatchNormali (None, 144, 136, 240 576 block_3_expand[0][0]


re_lu_4 (ReLU) (None, 144, 136, 240 0 block_3_expand_bn[0][0]


block_3_depthwise_pad (ZeroPadd (None, 144, 138, 242 0 re_lu_4[0][0]


block_3_depthwise (DepthwiseCon (None, 144, 68, 120) 1296 block_3_depthwise_pad[0][0]


block_3_depthwise_bn (BatchNorm (None, 144, 68, 120) 576 block_3_depthwise[0][0]


block_3_relu (ReLU) (None, 144, 68, 120) 0 block_3_depthwise_bn[0][0]


block_3_project (Conv2D) (None, 32, 68, 120) 4608 block_3_relu[0][0]


block_3_project_bn (BatchNormal (None, 32, 68, 120) 128 block_3_project[0][0]


block_4_expand (Conv2D) (None, 192, 68, 120) 6144 block_3_project_bn[0][0]


block_4_expand_bn (BatchNormali (None, 192, 68, 120) 768 block_4_expand[0][0]


re_lu_5 (ReLU) (None, 192, 68, 120) 0 block_4_expand_bn[0][0]


block_4_depthwise_pad (ZeroPadd (None, 192, 70, 122) 0 re_lu_5[0][0]


block_4_depthwise (DepthwiseCon (None, 192, 68, 120) 1728 block_4_depthwise_pad[0][0]


block_4_depthwise_bn (BatchNorm (None, 192, 68, 120) 768 block_4_depthwise[0][0]


block_4_relu (ReLU) (None, 192, 68, 120) 0 block_4_depthwise_bn[0][0]


block_4_project (Conv2D) (None, 32, 68, 120) 6144 block_4_relu[0][0]


block_4_projected_inputs (Conv2 (None, 32, 68, 120) 1024 block_3_project_bn[0][0]


block_4_project_bn (BatchNormal (None, 32, 68, 120) 128 block_4_project[0][0]


block_4_add (Add) (None, 32, 68, 120) 0 block_4_projected_inputs[0][0]
block_4_project_bn[0][0]


block_5_expand (Conv2D) (None, 192, 68, 120) 6144 block_4_add[0][0]


block_5_expand_bn (BatchNormali (None, 192, 68, 120) 768 block_5_expand[0][0]


re_lu_6 (ReLU) (None, 192, 68, 120) 0 block_5_expand_bn[0][0]


block_5_depthwise_pad (ZeroPadd (None, 192, 70, 122) 0 re_lu_6[0][0]


block_5_depthwise (DepthwiseCon (None, 192, 68, 120) 1728 block_5_depthwise_pad[0][0]


block_5_depthwise_bn (BatchNorm (None, 192, 68, 120) 768 block_5_depthwise[0][0]


block_5_relu (ReLU) (None, 192, 68, 120) 0 block_5_depthwise_bn[0][0]


block_5_project (Conv2D) (None, 32, 68, 120) 6144 block_5_relu[0][0]
block_5_projected_inputs (Conv2 (None, 32, 68, 120) 1024 block_4_add[0][0]


block_5_project_bn (BatchNormal (None, 32, 68, 120) 128 block_5_project[0][0]


block_5_add (Add) (None, 32, 68, 120) 0 block_5_projected_inputs[0][0]
block_5_project_bn[0][0]


block_6_expand (Conv2D) (None, 192, 68, 120) 6144 block_5_add[0][0]


block_6_expand_bn (BatchNormali (None, 192, 68, 120) 768 block_6_expand[0][0]


re_lu_7 (ReLU) (None, 192, 68, 120) 0 block_6_expand_bn[0][0]


block_6_depthwise_pad (ZeroPadd (None, 192, 70, 122) 0 re_lu_7[0][0]


block_6_depthwise (DepthwiseCon (None, 192, 34, 60) 1728 block_6_depthwise_pad[0][0]


block_6_depthwise_bn (BatchNorm (None, 192, 34, 60) 768 block_6_depthwise[0][0]


block_6_relu (ReLU) (None, 192, 34, 60) 0 block_6_depthwise_bn[0][0]


block_6_project (Conv2D) (None, 64, 34, 60) 12288 block_6_relu[0][0]


block_6_project_bn (BatchNormal (None, 64, 34, 60) 256 block_6_project[0][0]


block_7_expand (Conv2D) (None, 384, 34, 60) 24576 block_6_project_bn[0][0]


block_7_expand_bn (BatchNormali (None, 384, 34, 60) 1536 block_7_expand[0][0]


re_lu_8 (ReLU) (None, 384, 34, 60) 0 block_7_expand_bn[0][0]


block_7_depthwise_pad (ZeroPadd (None, 384, 36, 62) 0 re_lu_8[0][0]


block_7_depthwise (DepthwiseCon (None, 384, 34, 60) 3456 block_7_depthwise_pad[0][0]


block_7_depthwise_bn (BatchNorm (None, 384, 34, 60) 1536 block_7_depthwise[0][0]


block_7_relu (ReLU) (None, 384, 34, 60) 0 block_7_depthwise_bn[0][0]


block_7_project (Conv2D) (None, 64, 34, 60) 24576 block_7_relu[0][0]


block_7_projected_inputs (Conv2 (None, 64, 34, 60) 4096 block_6_project_bn[0][0]


block_7_project_bn (BatchNormal (None, 64, 34, 60) 256 block_7_project[0][0]


block_7_add (Add) (None, 64, 34, 60) 0 block_7_projected_inputs[0][0]
block_7_project_bn[0][0]


block_8_expand (Conv2D) (None, 384, 34, 60) 24576 block_7_add[0][0]


block_8_expand_bn (BatchNormali (None, 384, 34, 60) 1536 block_8_expand[0][0]


re_lu_9 (ReLU) (None, 384, 34, 60) 0 block_8_expand_bn[0][0]


block_8_depthwise_pad (ZeroPadd (None, 384, 36, 62) 0 re_lu_9[0][0]


block_8_depthwise (DepthwiseCon (None, 384, 34, 60) 3456 block_8_depthwise_pad[0][0]
__________________________________________________________________________________________________ [1344/1818]
block_8_relu (ReLU) (None, 384, 34, 60) 0 block_8_depthwise_bn[0][0]


block_8_project (Conv2D) (None, 64, 34, 60) 24576 block_8_relu[0][0]


block_8_projected_inputs (Conv2 (None, 64, 34, 60) 4096 block_7_add[0][0]


block_8_project_bn (BatchNormal (None, 64, 34, 60) 256 block_8_project[0][0]


block_8_add (Add) (None, 64, 34, 60) 0 block_8_projected_inputs[0][0]
block_8_project_bn[0][0]


block_9_expand (Conv2D) (None, 384, 34, 60) 24576 block_8_add[0][0]


block_9_expand_bn (BatchNormali (None, 384, 34, 60) 1536 block_9_expand[0][0]


re_lu_10 (ReLU) (None, 384, 34, 60) 0 block_9_expand_bn[0][0]


block_9_depthwise_pad (ZeroPadd (None, 384, 36, 62) 0 re_lu_10[0][0]


block_9_depthwise (DepthwiseCon (None, 384, 34, 60) 3456 block_9_depthwise_pad[0][0]


block_9_depthwise_bn (BatchNorm (None, 384, 34, 60) 1536 block_9_depthwise[0][0]


block_9_relu (ReLU) (None, 384, 34, 60) 0 block_9_depthwise_bn[0][0]


block_9_project (Conv2D) (None, 64, 34, 60) 24576 block_9_relu[0][0]


block_9_projected_inputs (Conv2 (None, 64, 34, 60) 4096 block_8_add[0][0]


block_9_project_bn (BatchNormal (None, 64, 34, 60) 256 block_9_project[0][0]


block_9_add (Add) (None, 64, 34, 60) 0 block_9_projected_inputs[0][0]
block_9_project_bn[0][0]


block_10_expand (Conv2D) (None, 384, 34, 60) 24576 block_9_add[0][0]


block_10_expand_bn (BatchNormal (None, 384, 34, 60) 1536 block_10_expand[0][0]


re_lu_11 (ReLU) (None, 384, 34, 60) 0 block_10_expand_bn[0][0]


block_10_depthwise_pad (ZeroPad (None, 384, 36, 62) 0 re_lu_11[0][0]


block_10_depthwise (DepthwiseCo (None, 384, 34, 60) 3456 block_10_depthwise_pad[0][0]


block_10_depthwise_bn (BatchNor (None, 384, 34, 60) 1536 block_10_depthwise[0][0]


block_10_relu (ReLU) (None, 384, 34, 60) 0 block_10_depthwise_bn[0][0]


block_10_project (Conv2D) (None, 96, 34, 60) 36864 block_10_relu[0][0]


block_10_project_bn (BatchNorma (None, 96, 34, 60) 384 block_10_project[0][0]


block_11_expand (Conv2D) (None, 576, 34, 60) 55296 block_10_project_bn[0][0]


block_11_expand_bn (BatchNormal (None, 576, 34, 60) 2304 block_11_expand[0][0]


re_lu_12 (ReLU) (None, 576, 34, 60) 0 block_11_expand_bn[0][0]


block_11_depthwise_pad (ZeroPad (None, 576, 36, 62) 0 re_lu_12[0][0]


block_11_depthwise (DepthwiseCo (None, 576, 34, 60) 5184 block_11_depthwise_pad[0][0]


block_11_depthwise_bn (BatchNor (None, 576, 34, 60) 2304 block_11_depthwise[0][0]


block_11_relu (ReLU) (None, 576, 34, 60) 0 block_11_depthwise_bn[0][0]


block_11_project (Conv2D) (None, 96, 34, 60) 55296 block_11_relu[0][0]


block_11_projected_inputs (Conv (None, 96, 34, 60) 9216 block_10_project_bn[0][0]


block_11_project_bn (BatchNorma (None, 96, 34, 60) 384 block_11_project[0][0]


block_11_add (Add) (None, 96, 34, 60) 0 block_11_projected_inputs[0][0]
block_11_project_bn[0][0]


block_12_expand (Conv2D) (None, 576, 34, 60) 55296 block_11_add[0][0]


block_12_expand_bn (BatchNormal (None, 576, 34, 60) 2304 block_12_expand[0][0]


re_lu_13 (ReLU) (None, 576, 34, 60) 0 block_12_expand_bn[0][0]


block_12_depthwise_pad (ZeroPad (None, 576, 36, 62) 0 re_lu_13[0][0]


block_12_depthwise (DepthwiseCo (None, 576, 34, 60) 5184 block_12_depthwise_pad[0][0]


block_12_depthwise_bn (BatchNor (None, 576, 34, 60) 2304 block_12_depthwise[0][0]


block_12_relu (ReLU) (None, 576, 34, 60) 0 block_12_depthwise_bn[0][0]


block_12_project (Conv2D) (None, 96, 34, 60) 55296 block_12_relu[0][0]


block_12_projected_inputs (Conv (None, 96, 34, 60) 9216 block_11_add[0][0]


block_12_project_bn (BatchNorma (None, 96, 34, 60) 384 block_12_project[0][0]


block_12_add (Add) (None, 96, 34, 60) 0 block_12_projected_inputs[0][0]
block_12_project_bn[0][0]


output_bbox (Conv2D) (None, 4, 34, 60) 388 block_12_add[0][0]


output_cov (Conv2D) (None, 1, 34, 60) 97 block_12_add[0][0]

Total params: 592,485
Trainable params: 574,693
Non-trainable params: 17,792


2020-09-23 09:44:19,254 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-23 09:44:19,254 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-23 09:44:19,254 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-23 09:44:19,254 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-23 09:44:19,254 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 630, number of sources: 1, batch size per gpu: 8, steps: 79
d to polygon coordinates.
2020-09-23 09:44:19,778 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-09-23 09:44:19,787 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-23 09:44:19,787 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-23 09:44:20,474 [INFO] iva.detectnet_v2.scripts.train: Found 630 samples in training set
2020-09-23 09:44:24,252 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-23 09:44:24,252 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-23 09:44:24,253 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-23 09:44:24,253 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-23 09:44:24,253 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 270, number of sources: 1, batch size per gpu: 8, steps: 34
2020-09-23 09:44:24,302 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converte
d to polygon coordinates.
2020-09-23 09:44:24,633 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-09-23 09:44:24,641 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-23 09:44:24,641 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-23 09:44:25,099 [INFO] iva.detectnet_v2.scripts.train: Found 270 samples in validation set
2020-09-23 09:45:30,538 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 0/80: loss: 0.18326 Time taken: 0:00:00 ETA: 0:00:00
2020-09-23 09:45:30,539 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.172
2020-09-23 09:45:47,247 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.499
2020-09-23 09:45:58,143 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.357
2020-09-23 09:46:08,984 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.448
2020-09-23 09:46:11,116 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 33, 0.00s/step
2020-09-23 09:46:40,000 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 33, 2.89s/step
2020-09-23 09:47:07,359 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 33, 2.74s/step
2020-09-23 09:47:34,436 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 33, 2.71s/step
Matching predictions to ground truth, class 1/1.: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 6864/6864 [00:00<00:00, 10921.24it/s]
Epoch 1/80

Validation cost: 0.009540
Mean average_precision (in %): 0.0000

class name average precision (in %)


car 0

Median Inference Time: 0.017651
2020-09-23 09:47:43,804 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 1/80: loss: 0.01008 Time taken: 0:02:19.580743 ETA: 3:03:46.8
78684
2020-09-23 09:47:52,528 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.932
2020-09-23 09:48:03,706 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.892
2020-09-23 09:48:15,014 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.688
2020-09-23 09:48:18,988 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 2/80: loss: 0.01175 Time taken: 0:00:35.246779 ETA: 0:45:49.2
48741
2020-09-23 09:48:26,218 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.851
^[[B^[[B^[[B2020-09-23 09:48:37,359 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.953
2020-09-23 09:48:48,362 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.177
2020-09-23 09:48:53,997 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 3/80: loss: 0.01263 Time taken: 0:00:35.033285 ETA: 0:44:57.5
62956
2020-09-23 09:48:59,190 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.472
2020-09-23 09:49:10,280 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.036
2020-09-23 09:49:21,425 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.945
2020-09-23 09:49:29,018 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 4/80: loss: 0.00813 Time taken: 0:00:34.996720 ETA: 0:44:19.7
50726
2020-09-23 09:49:32,591 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.913
2020-09-23 09:49:43,910 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.670
2020-09-23 09:49:55,128 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.829
2020-09-23 09:50:04,484 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 5/80: loss: 0.00712 Time taken: 0:00:35.447932 ETA: 0:44:18.5
94865
2020-09-23 09:50:06,310 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.886
2020-09-23 09:50:17,359 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.104
2020-09-23 09:50:28,399 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.116
2020-09-23 09:50:39,556 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 6/80: loss: 0.00435 Time taken: 0:00:35.067025 ETA: 0:43:14.9
59864
2020-09-23 09:50:39,556 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.926
2020-09-23 09:50:50,602 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.108
2020-09-23 09:51:01,731 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.971
2020-09-23 09:51:12,790 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.085
2020-09-23 09:51:14,609 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 7/80: loss: 0.00237 Time taken: 0:00:35.048755 ETA: 0:42:38.5
59093
2020-09-23 09:51:23,846 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.091
2020-09-23 09:51:34,991 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.946
2020-09-23 09:51:46,040 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.103
2020-09-23 09:51:49,611 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 8/80: loss: 0.00084 Time taken: 0:00:35.014211 ETA: 0:42:01.0
23222
2020-09-23 09:51:57,094 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.093
2020-09-23 09:52:08,209 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.995
2020-09-23 09:52:19,296 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.040
2020-09-23 09:52:24,605 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 9/80: loss: 0.00085 Time taken: 0:00:35.023525 ETA: 0:41:26.6
70258
2020-09-23 09:52:30,518 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.823
2020-09-23 09:52:41,556 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.120
2020-09-23 09:52:52,559 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.177
2020-09-23 09:53:01,307 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 10/80: loss: 0.00082 Time taken: 0:00:36.697000 ETA: 0:42:48.
789985
2020-09-23 09:53:05,336 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 15.653
2020-09-23 09:53:16,557 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.826
2020-09-23 09:53:27,764 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.847
2020-09-23 09:53:36,188 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 33, 0.00s/step
2020-09-23 09:53:51,062 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 33, 1.49s/step
2020-09-23 09:54:05,819 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 33, 1.48s/step
2020-09-23 09:54:20,485 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 33, 1.47s/step
Epoch 11/80

Validation cost: 0.002639
Mean average_precision (in %): 0.0000

class name average precision (in %)


car 0
Median Inference Time: 0.016214
2020-09-23 09:54:25,404 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 11/80: loss: 0.00093 Time taken: 0:01:24.045210 ETA: 1:36:39.
119466
2020-09-23 09:54:27,651 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 3.340
2020-09-23 09:54:38,731 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.051
2020-09-23 09:54:50,061 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.653
2020-09-23 09:55:00,918 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 12/80: loss: 0.00059 Time taken: 0:00:35.541067 ETA: 0:40:16.
792548
2020-09-23 09:55:01,347 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.722
2020-09-23 09:55:12,603 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.769
2020-09-23 09:55:23,766 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.917
2020-09-23 09:55:34,679 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.329
2020-09-23 09:55:35,990 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 13/80: loss: 0.00071 Time taken: 0:00:35.101220 ETA: 0:39:11.
781749
2020-09-23 09:55:45,698 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.150
2020-09-23 09:55:56,674 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.223
2020-09-23 09:56:07,776 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.016
2020-09-23 09:56:10,847 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 14/80: loss: 0.00061 Time taken: 0:00:34.828262 ETA: 0:38:18.
665282
2020-09-23 09:56:18,880 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.012
2020-09-23 09:56:29,919 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.118
2020-09-23 09:56:41,122 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.853
2020-09-23 09:56:46,071 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 15/80: loss: 0.00078 Time taken: 0:00:35.244284 ETA: 0:38:10.
878485
2020-09-23 09:56:52,284 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.918
2020-09-23 09:57:03,374 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.035
2020-09-23 09:57:14,467 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.030
2020-09-23 09:57:21,046 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 16/80: loss: 0.00089 Time taken: 0:00:34.942158 ETA: 0:37:16.
298096
2020-09-23 09:57:25,410 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.277
2020-09-23 09:57:36,475 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.076
2020-09-23 09:57:47,634 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.925
2020-09-23 09:57:56,111 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 17/80: loss: 0.00069 Time taken: 0:00:35.072267 ETA: 0:36:49.
552840
2020-09-23 09:57:58,730 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.025
2020-09-23 09:58:09,920 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.874
2020-09-23 09:58:21,016 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.024
2020-09-23 09:58:31,135 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 18/80: loss: 0.00067 Time taken: 0:00:35.048826 ETA: 0:36:13.
027240
2020-09-23 09:58:31,998 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.212
2020-09-23 09:58:43,121 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.982
2020-09-23 09:58:54,269 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.942
2020-09-23 09:59:05,328 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.085
2020-09-23 09:59:06,199 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 19/80: loss: 0.00066 Time taken: 0:00:35.064729 ETA: 0:35:38.
948453
2020-09-23 09:59:16,484 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.928
2020-09-23 09:59:27,493 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.167
2020-09-23 09:59:38,485 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.197
2020-09-23 09:59:42,666 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 20/80: loss: 0.00072 Time taken: 0:00:36.439418 ETA: 0:36:26.
365070
2020-09-23 09:59:51,065 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 15.899
2020-09-23 10:00:02,255 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.874
2020-09-23 10:00:13,371 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.993
2020-09-23 10:00:17,297 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 33, 0.00s/step
2020-09-23 10:00:31,883 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 33, 1.46s/step
2020-09-23 10:00:46,720 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 33, 1.48s/step
2020-09-23 10:01:01,567 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 33, 1.48s/step
Epoch 21/80

Validation cost: 0.001487
Mean average_precision (in %): 0.0000

class name average precision (in %)


car 0

Median Inference Time: 0.016736
2020-09-23 10:01:06,552 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 21/80: loss: 0.00067 Time taken: 0:01:23.846728 ETA: 1:22:26.
956971
2020-09-23 10:01:13,175 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 3.344
2020-09-23 10:01:24,340 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.913
2020-09-23 10:01:35,585 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.787
2020-09-23 10:01:41,952 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 22/80: loss: 0.00047 Time taken: 0:00:35.437569 ETA: 0:34:15.
379010
2020-09-23 10:01:46,820 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.802
2020-09-23 10:01:57,917 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.023
2020-09-23 10:02:09,101 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.885
2020-09-23 10:02:17,082 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 23/80: loss: 0.00052 Time taken: 0:00:35.134397 ETA: 0:33:22.
660644
2020-09-23 10:02:20,128 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.138
2020-09-23 10:02:31,116 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.203
2020-09-23 10:02:42,140 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.142
2020-09-23 10:02:51,820 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 24/80: loss: 0.00065 Time taken: 0:00:34.753177 ETA: 0:32:26.
177935
2020-09-23 10:02:53,184 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.110
2020-09-23 10:03:04,347 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.917
2020-09-23 10:03:15,399 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.097
2020-09-23 10:03:26,390 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.198
2020-09-23 10:03:26,852 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 25/80: loss: 0.00050 Time taken: 0:00:34.998227 ETA: 0:32:04.
902478
2020-09-23 10:03:37,513 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.982
2020-09-23 10:03:48,404 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.364
2020-09-23 10:03:59,345 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.281
2020-09-23 10:04:01,559 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 26/80: loss: 0.00032 Time taken: 0:00:34.747539 ETA: 0:31:16.
367121
2020-09-23 10:04:10,564 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.827
2020-09-23 10:04:21,535 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.231
2020-09-23 10:04:32,776 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.793
2020-09-23 10:04:36,708 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 27/80: loss: 0.00056 Time taken: 0:00:35.141604 ETA: 0:31:02.
504997
2020-09-23 10:04:43,786 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.166
2020-09-23 10:04:54,851 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.076
2020-09-23 10:05:06,069 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.829
2020-09-23 10:05:11,801 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 28/80: loss: 0.00045 Time taken: 0:00:35.056081 ETA: 0:30:22.
916227
2020-09-23 10:05:17,194 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.978
2020-09-23 10:05:28,259 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.076
2020-09-23 10:05:39,322 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.079
2020-09-23 10:05:46,813 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 29/80: loss: 0.00081 Time taken: 0:00:35.031645 ETA: 0:29:46.
613910
2020-09-23 10:05:50,350 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.135
2020-09-23 10:06:01,449 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 18.021

Moving this topic from Xavier forum into TLT forum.

@sukyoung.cho
Could you paste one label file here?
More, please paste the full log when you generate tfrecords using tlt-dataset-convert.

Did you resize your images/labels to 960x544?

Yes I rezsized images / labels to 960x544. I have used almost same config file with the config file which i used to train TrafficCamModel (one that I successfully trained and exported).

New update is that, I could train mobileNet_v2 using the config file which this guy uploaded (Retraining with pretrained tlt models). I wonder why the mAP after 200 epoch is only about 69%. Even TrafficCamNet resulted in a similar mAP result, which is a way lower than the benchmark you provided and what I have expected. It would be appreciated if you could tell me what could be the possible reason behind low mAP issue.

Not sure if this is what you mean by one label file.

Label File 1

car 0 3 -1 46 358 195 449 -1 -1 -1 -1 -1 -1 -1
car 0 3 -1 366 291 426 335 -1 -1 -1 -1 -1 -1 -1
car 0 3 -1 418 238 451 262 -1 -1 -1 -1 -1 -1 -1
car 0 3 -1 478 238 511 258 -1 -1 -1 -1 -1 -1 -1
car 0 3 -1 477 216 499 233 -1 -1 -1 -1 -1 -1 -1
car 0 3 -1 379 214 423 245 -1 -1 -1 -1 -1 -1 -1

Lable File 2

kitti_config {
root_directory_path: “/workspace/data/day_all”
image_dir_name: “day_all_jpg”
label_dir_name: “day_all_txt”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2

split percentage of train_validation

val_split: 30

how many make tfrecord shards?

num_shards: 1 }

Your label is not correct. See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#label_file, your above label should modify to

car 0.0 0 0.0 46 358 195 449 0.00 0.00 0.00 0.00 0.00 0.00 0.00

More, for your mentioned “why the mAP after 200 epoch is only about 69%”, where is 69%? I did not find any info about it.

  1. I will try training using a new label file you have suggested.

  2. It is not uploaded here, because it was done this morning.

Thank you! I will update after trying a new label files

After done, please attach your full log as an attachment instead of pasting it directly on the forum. Thanks.

2 Likes

do you mean full log of training? thanks

Yes.

More, since your training dataset is only 630 images. So, suggesting trying different batch-size too.

Problem Solved

After lots of experiment, I realized that changing the value of {minimum_bounding_box_height} under {postprocessing_config} solved the problem.

changing minimum_bounding_box_height from 20 to 4, dramatically increased the mAP result even only after 10 epochs.

Thanks for the info. Appreciate for your work.