During training, the mAP value becomes 0

I am training with TLT darknet53.
At first, the training was good, then suddenly the loss increased, and the mAP became 0.
Why is this?

random_seed: 42
yolo_config {
big_anchor_shape: “[(116,90), (156,198), (373,326)]”
mid_anchor_shape: “[(30,61), (62,45), (59,119)]”
small_anchor_shape: “[(10,13), (16,30), (33,23)]”
matching_neutral_box_iou: 0.5

arch: “darknet”
nlayers: 53
arch_conv_blocks: 2

loss_loc_weight: 5.0
loss_neg_obj_weights: 50.0
loss_class_weights: 1.0

freeze_bn: True
}
training_config {
batch_size_per_gpu: 4
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 1e-4
soft_start: 0.1
annealing: 0.5
}
}
regularizer {
type: L1
weight: 3.0e-06
}
}
eval_config {
validation_period_during_training: 5
average_precision_mode: SAMPLE
batch_size: 32
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.05
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
output_image_channel: 3
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 0.7
zoom_max: 1.8
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tlt-experiments/tlt-experiments/tfrecords/yolo*”
image_directory_path: “/workspace/tlt-experiments/dataset/”
}
image_extension: “jpg”
target_class_mapping {
key: “a”
value: “a”
}
target_class_mapping {
key: “b”
value: “b”
}
target_class_mapping {
key: “c”
value: “c”
}
target_class_mapping {
key: “ca”
value: “ca”
}
target_class_mapping {
key: “p”
value: “p”
}

target_class_mapping {
key: “f”
value: “f”
}

target_class_mapping {
key: “ch”
value: “ch”
}
validation_fold: 0
}

To resume from checkpoint, please uncomment and run this instead. Change last two arguments accordingly.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.

[[61000,1],6]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: b87a823f08e9

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

[b87a823f08e9:75994] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[b87a823f08e9:75994] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
2020-06-15 08:22:11,257 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,260 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,269 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,272 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,271 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,273 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,332 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,335 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,361 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,365 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,367 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,370 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,370 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,373 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
2020-06-15 08:22:11,373 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt.
2020-06-15 08:22:11,376 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/tlt-experiments/specs/yolo_train_resnet18_kitti_1st.txt
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-06-15 08:22:35,205 [INFO] iva.yolo.scripts.train: Loading pretrained weights. This may take a while…

Total params: 61,608,652
Trainable params: 61,556,044
Non-trainable params: 52,608


2020-06-15 08:32:37,805 [INFO] iva.yolo.scripts.train: Number of images in the training dataset: 170667
[2020-06-15 08:33:48.740614: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:HorovodBroadcast_yolo_conv1_1_bn_moving_variance_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740707: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_kernel_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740724: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_bn_gamma_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740742: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_bn_moving_mean_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740758: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_2_bn_moving_variance_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740774: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_3_bn_moving_mean_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740788: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_3_bn_gamma_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740805: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_3_bn_beta_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740820: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_4_bn_beta_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740836: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_4_bn_gamma_0 [missing ranks: 1, 2]
[2020-06-15 08:33:48.740850: W horovod/common/operations.cc:588] HorovodBroadcast_yolo_conv1_5_kernel_0 [missing ranks: 1, 2]

Epoch 00033: saving model to /workspace/tlt-experiments/tlt-experiments/yolo_air_dir_unpruned/weights/yolo_darknet53_epoch_033.tlt
Epoch 34/80
5333/5333 [==============================] - 3195s 599ms/step - loss: 3.1525

epoch AP_a AP_b AP_c AP_ca AP_ch AP_f AP_p loss mAP
29 1.000000000 0.998750474 0.990289074 0.999720648 0.992470422 0.972258886 0.992377230 0.111586675 0.992266676
30 nan nan nan nan nan nan nan 0.109178776 nan
31 nan nan nan nan nan nan nan 0.107399691 nan
32 nan nan nan nan nan nan nan 1.687666342 nan
33 nan nan nan nan nan nan nan 3.143895326 nan
34 0 0 0 0 0 0 0 2.979488941 0
35 nan nan nan nan nan nan nan 2.924651953 nan
36 nan nan nan nan nan nan nan 2.92550009 nan
37 nan nan nan nan nan nan nan 2.923504608 nan
38 nan nan nan nan nan nan nan 2.900475653 nan
39 0 0 0 0 0 0 0 2.883515123 0
40 nan nan nan nan nan nan nan 2.882481353 nan
41 nan nan nan nan nan nan nan 2.883869723 nan
42 nan nan nan nan nan nan nan 2.869718783 nan
43 nan nan nan nan nan nan nan 2.85622538 nan
44 0 0 0 0 0 0 0 2.845303312 0
45 nan nan nan nan nan nan nan 2.83496629 nan
46 nan nan nan nan nan nan nan 2.822654668 nan
47 nan nan nan nan nan nan nan 2.820003481 nan

Seems that you are training via 8gpus. Suggest you to finetune the max learning rate firstly.
Consider increasing it and retry.