TLT with YOLOv3 Achieved 0 MaP after 120 Epoch

Hi,

Please have a look at my result training my dataset on YOLOv3 using Transfer Learning Toolkit. I could achieve a reasonably good map when trained the dataset with detectnetv2 and able to transfer it to deepstream but it does not do with YOLOv3.

tlt-streamanalytics:v2.0_py3
Ubuntu 18.04
GPU Geforce 1650 4GB

Dataset :
Image : JPG ( 480*288)
Label : KITTI.

 car 0.0 0 0.0 65.8 103.8 146.8 152.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 person 0.0 0 0.0 70.7 80.3 81.3 102.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

TF record config file
kitti_config {
root_directory_path: “/workspace/dataset/training”
image_dir_name: “image_2”
label_dir_name: “label_2”
image_extension: “.jpg”
partition_mode: “random”
num_partitions:2
val_split: 20
num_shards: 10
}
image_directory_path: “/workspace/dataset/training”

Train config file
random_seed: 42
yolo_config {
big_anchor_shape: “[(84.60, 40.60), (97.70, 61.50), (131.90, 102.10)]”
mid_anchor_shape: “[(63.00, 25.80), (44.20, 39.30), (69.00, 31.50)]”
small_anchor_shape: “[(10.60, 21.00), (15.70, 28.30), (36.00, 26.70)]”
matching_neutral_box_iou: 0.5
arch: “resnet”
nlayers: 18
arch_conv_blocks: 0
loss_loc_weight: 5.0
loss_neg_obj_weights: 50.0
loss_class_weights: 1.0
freeze_bn: True
freeze_blocks: 0
freeze_blocks: 1}
training_config {
batch_size_per_gpu: 5
num_epochs: 10
enable_qat: false
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-5
max_learning_rate: 2e-2
soft_start: 0.15
annealing: 0.8
}
}
regularizer {
type: L1
weight: 3e-5
}
}
eval_config {
validation_period_during_training: 1
average_precision_mode: INTEGRATE
batch_size: 5
matching_iou_threshold: 0.3
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
preprocessing {
output_image_width: 480
output_image_height: 288
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 0.7
zoom_max: 1.8
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tf_records/*”
image_directory_path: “/workspace/dataset/training”
}
image_extension: “jpg”
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “bus”
value: “bus”
}
target_class_mapping {
key: “truck”
value: “truck”
}
target_class_mapping {
key: “motorcycle”
value: “motorcycle”
}
target_class_mapping {
key: “bicycle”
value: “bicycle”
}
validation_fold: 0
}


TF records results:
2020-09-27 22:53:39.245952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Using TensorFlow backend.
2020-09-27 22:53:41,037 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-09-27 22:53:41,037 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Creating output directory /workspace/tf_records
2020-09-27 22:53:41,040 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 1039 Val: 259
2020-09-27 22:53:41,040 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-09-27 22:53:41,041 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-09-27 22:53:41,041 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-09-27 22:53:41,063 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-09-27 22:53:41,082 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-09-27 22:53:41,101 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-09-27 22:53:41,119 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-09-27 22:53:41,137 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-09-27 22:53:41,155 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-09-27 22:53:41,172 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-09-27 22:53:41,191 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-09-27 22:53:41,209 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-09-27 22:53:41,234 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’car’: 266
b’person’: 250
b’bus’: 53
b’motorcycle’: 3
b’truck’: 14
b’bicycle’: 7

2020-09-27 22:53:41,234 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-09-27 22:53:41,306 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-09-27 22:53:41,379 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-09-27 22:53:41,451 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-09-27 22:53:41,525 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-09-27 22:53:41,599 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-09-27 22:53:41,673 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-09-27 22:53:41,752 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-09-27 22:53:41,838 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-09-27 22:53:41,922 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-09-27 22:53:42,014 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 1037
b’car’: 935
b’truck’: 46
b’bus’: 206
b’motorcycle’: 12
b’bicycle’: 13

2020-09-27 22:53:42,014 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-09-27 22:53:42,014 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’car’: 1201
b’person’: 1287
b’bus’: 259
b’motorcycle’: 15
b’truck’: 60
b’bicycle’: 20

2020-09-27 22:53:42,014 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’car’: b’car’
b’person’: b’person’
b’bus’: b’bus’
b’motorcycle’: b’motorcycle’
b’truck’: b’truck’
b’bicycle’: b’bicycle’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-09-27 22:53:42,014 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.


Train result


Epoch 8/10
208/208 [==============================] - 46s 220ms/step - loss: 10.3667

Epoch 00008: saving model to /workspace/trained_model_yolo/weights/yolo_resnet18_epoch_008.tlt
Number of images in the evaluation dataset: 259

Producing predictions: 100%|████████████████████| 52/52 [00:06<00:00, 7.55it/s]
Start multi-thread per-image matching
Start to calculate AP for each class


bicycle AP 0.0
bus AP 0.0
car AP 0.0
motorcycle AP 0.0
person AP 0.0
truck AP 0.0
mAP 0.0


Epoch 9/10
208/208 [==============================] - 45s 215ms/step - loss: 9.3539

Epoch 00009: saving model to /workspace/trained_model_yolo/weights/yolo_resnet18_epoch_009.tlt
Number of images in the evaluation dataset: 259

Producing predictions: 100%|████████████████████| 52/52 [00:06<00:00, 7.63it/s]
Start multi-thread per-image matching
Start to calculate AP for each class


bicycle AP 0.0
bus AP 0.0
car AP 0.0
motorcycle AP 0.0
person AP 0.0
truck AP 0.0
mAP 0.0


Epoch 10/10
208/208 [==============================] - 46s 220ms/step - loss: 8.7018

Epoch 00010: saving model to /workspace/trained_model_yolo/weights/yolo_resnet18_epoch_010.tlt
Number of images in the evaluation dataset: 259

Producing predictions: 100%|████████████████████| 52/52 [00:06<00:00, 7.52it/s]
Start multi-thread per-image matching
Start to calculate AP for each class


bicycle AP 0.0
bus AP 0.0
car AP 0.0
motorcycle AP 0.0
person AP 0.0
truck AP 0.0
mAP 0.0


I tried to reduce the Iou threshold to 0.5,0.3, and 0.2 and also using kmeans.py to obtain anchor shapes but still achieved 0 mAp.
The training shows a similar result (0 mAp) for 10 epochs, 50 epoch, and 120 epochs.

Are there any wrongs in my config file?

cheers!

I tried to include the resnet 18 pre trained model and it gives this result.


File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1770, in init
control_input_ops)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1610, in _create_c_op
raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 1024 and 512. Shapes are [1024,176] and [512,176]. for ‘Assign_557’ (op: ‘Assign’) with input shapes: [1024,176], [512,176].


However, dimension of my dataset image is 480*288 which is a multiplication of 32.

This is the result of training ssd based on similar dataset. It could produce mAP.

Epoch 00005: saving model to /workspace/trained_model_ssd/weights/ssd_resnet18_epoch_005.tlt
Number of images in the evaluation dataset: 259

Producing predictions: 100%|████████████████████| 33/33 [00:05<00:00, 5.94it/s]
Start multi-thread per-image matching
Start to calculate AP for each class


bicycle AP 0.0
bus AP 0.683
car AP 0.796
motorcycle AP 0.019
person AP 0.657
truck AP 0.07
mAP 0.371


For your yolo_v3 training, can you check if the loss reducing?

Yes. It is reducing after each epoch

Could you please add below in your spec and have a quick retry?

crop_right: 480
crop_bottom: 288

I did include it in config file. loss reduced but still produced 0 map

Could you try to trigger some experiments to narrow down the issue?

  1. Try to train with only one class , for example, person
  2. Try to set different bs

Tried it.
Still produced 0 map

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Could you share your latest training spec and training log?
If possible, could you share a small part of training data ?