I am training with TLT darknet53.
At first, the training was good, then suddenly the loss increased, and the mAP became 0.
Why is this?
random_seed: 42
yolo_config {
big_anchor_shape: “[(116,90), (156,198), (373,326)]”
mid_anchor_shape: “[(30,61), (62,45), (59,119)]”
small_anchor_shape: “[(10,13), (16,30), (33,23)]”
matching_neutral_box_iou: 0.5
arch: “darknet”
nlayers: 53
arch_conv_blocks: 2
loss_loc_weight: 5.0
loss_neg_obj_weights: 50.0
loss_class_weights: 1.0
freeze_bn: True
}
training_config {
batch_size_per_gpu: 4
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 1e-4
soft_start: 0.1
annealing: 0.5
}
}
regularizer {
type: L1
weight: 3.0e-06
}
}
eval_config {
validation_period_during_training: 5
average_precision_mode: SAMPLE
batch_size: 32
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.05
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
output_image_channel: 3
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 0.7
zoom_max: 1.8
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tlt-experiments/tlt-experiments/tfrecords/yolo*”
image_directory_path: “/workspace/tlt-experiments/dataset/”
}
image_extension: “jpg”
target_class_mapping {
key: “a”
value: “a”
}
target_class_mapping {
key: “b”
value: “b”
}
target_class_mapping {
key: “c”
value: “c”
}
target_class_mapping {
key: “ca”
value: “ca”
}
target_class_mapping {
key: “p”
value: “p”
}
target_class_mapping {
key: “f”
value: “f”
}
target_class_mapping {
key: “ch”
value: “ch”
}
validation_fold: 0
}
To resume from checkpoint, please uncomment and run this instead. Change last two arguments accordingly.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
[[61000,1],6]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: b87a823f08e9
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
[b87a823f08e9:75994] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[b87a823f08e9:75994] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
2020-06-15 08:22:11,257 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,260 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,269 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,272 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,271 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,273 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,332 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,335 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,361 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,365 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,367 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,370 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,370 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,373 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,373 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,376 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-06-15 08:22:35,205 [INFO] iva.yolo.scripts.train: Loading pretrained weights. This may take a while…
Total params: 61,608,652
Trainable params: 61,556,044
Non-trainable params: 52,608
2020-06-15 08:32:37,805 [INFO] iva.yolo.scripts.train: Number of images in the training dataset: 170667
[2020-06-15 08:33:48.740614: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:HorovodBroadcast_yolo_conv1_1_bn_moving_variance_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740707: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_kernel_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740724: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_bn_gamma_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740742: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_bn_moving_mean_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740758: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_bn_moving_variance_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740774: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_3_bn_moving_mean_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740788: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_3_bn_gamma_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740805: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_3_bn_beta_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740820: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_4_bn_beta_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740836: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_4_bn_gamma_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740850: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_5_kernel_0 [missing ranks: 1, 2]
Epoch 00033: saving model to /workspace/tlt-experiments/tlt-experiments/yolo_air_dir_unpruned/weights/yolo_darknet53_epoch_033.tlt
Epoch 34/80
5333/5333 [==============================] - 3195s 599ms/step - loss: 3.1525
epoch | AP_a | AP_b | AP_c | AP_ca | AP_ch | AP_f | AP_p | loss | mAP |
---|---|---|---|---|---|---|---|---|---|
29 | 1.000000000 | 0.998750474 | 0.990289074 | 0.999720648 | 0.992470422 | 0.972258886 | 0.992377230 | 0.111586675 | 0.992266676 |
30 | nan | nan | nan | nan | nan | nan | nan | 0.109178776 | nan |
31 | nan | nan | nan | nan | nan | nan | nan | 0.107399691 | nan |
32 | nan | nan | nan | nan | nan | nan | nan | 1.687666342 | nan |
33 | nan | nan | nan | nan | nan | nan | nan | 3.143895326 | nan |
34 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.979488941 | 0 |
35 | nan | nan | nan | nan | nan | nan | nan | 2.924651953 | nan |
36 | nan | nan | nan | nan | nan | nan | nan | 2.92550009 | nan |
37 | nan | nan | nan | nan | nan | nan | nan | 2.923504608 | nan |
38 | nan | nan | nan | nan | nan | nan | nan | 2.900475653 | nan |
39 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.883515123 | 0 |
40 | nan | nan | nan | nan | nan | nan | nan | 2.882481353 | nan |
41 | nan | nan | nan | nan | nan | nan | nan | 2.883869723 | nan |
42 | nan | nan | nan | nan | nan | nan | nan | 2.869718783 | nan |
43 | nan | nan | nan | nan | nan | nan | nan | 2.85622538 | nan |
44 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.845303312 | 0 |
45 | nan | nan | nan | nan | nan | nan | nan | 2.83496629 | nan |
46 | nan | nan | nan | nan | nan | nan | nan | 2.822654668 | nan |
47 | nan | nan | nan | nan | nan | nan | nan | 2.820003481 | nan |