Unable to retrain pre-trained SSD Mobilenet v1

ejameson · August 5, 2020, 8:39pm

I am trying to retrain the tlt_pretrained_object_detection:mobilenet_v1 model with my own KITTI-formatted dataset (1 class, “person”), as per the instructions in the Getting Started Guide and this blog post. I am using the tlt-streamanalytics:v2.0_dp_py2 Docker image for this.

Firstly, I am converting the KITTI dataset into TFrecords with the following command:
tlt-dataset-convert -d convert.spec -o ./tfrecords/converted.tfrecord
and this convert.spec file:

kitti_config {
  root_directory_path: "[REPLACE_WITH_DATASET_DIR]"
  image_dir_name: "images"
  label_dir_name: "labels"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 20
  num_shards: 10
}
image_directory_path: "[REPLACE_WITH_DATASET_DIR]"

The [REPLACE_WITH_DATASET_DIR] is replaced with my actual dataset directory in my spec file.
The output of that command is:

Using TensorFlow backend.
2020-08-05 20:19:08,221 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-08-05 20:19:08,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 7955     Val: 1988
2020-08-05 20:19:08,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-08-05 20:19:08,245 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-08-05 20:19:08,402 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-08-05 20:19:08,552 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-08-05 20:19:08,702 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-08-05 20:19:08,853 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-08-05 20:19:09,003 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-08-05 20:19:09,153 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-08-05 20:19:09,303 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-08-05 20:19:09,454 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-08-05 20:19:09,604 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-08-05 20:19:09,760 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
person: 1988

2020-08-05 20:19:09,760 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-08-05 20:19:10,363 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-08-05 20:19:10,965 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-08-05 20:19:11,567 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-08-05 20:19:12,170 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-08-05 20:19:12,772 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-08-05 20:19:13,375 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-08-05 20:19:13,978 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-08-05 20:19:14,581 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-08-05 20:19:15,184 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-08-05 20:19:15,791 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
person: 7955

2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
person: 9943

2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
person: person
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

After that, I download the model with this command:
ngc registry model download-version nvidia/tlt_pretrained_object_detection:mobilenet_v1 -d ./pretrained_model

Then, I try and train with this command:
tlt-train ssd -e train.spec -r ./pretrained_model --gpus 1 -k $NGC_API_KEY
and this train.spec file:

training_config {
  batch_size_per_gpu: 32
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 224
    output_image_height: 224
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {

    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "[REPLACE_WITH_DATASET_DIR]/tfrecords/*"
    image_directory_path: "[REPLACE_WITH_DATASET_DIR]/images"
  }
  image_extension: "png"
  target_class_mapping {
      key: "person"
      value: "person"
  }
  validation_fold: 0
}
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 0.33]"
  aspect_ratios: "[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]"
  two_boxes_for_ar1: true
  clip_boxes: false
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  loss_loc_weight: 1.0
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "mobilenet_v1"
  freeze_bn: false
}

The [REPLACE_WITH_DATASET_DIR] is replaced with my actual dataset directory in my spec file.
The output of this command is:

Using TensorFlow backend.
--------------------------------------------------------------------------
[[14866,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 2b629d1e986b

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2020-08-05 20:27:49,098 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from train.spec
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 96, in run_experiment
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 206, in get_dataset_tensors
  File "./detectnet_v2/dataloader/default_dataloader.py", line 232, in _generate_images_and_ground_truth_labels
  File "./modulus/processors/processors.py", line 227, in __call__
  File "./detectnet_v2/dataloader/utilities.py", line 60, in call
  File "./modulus/processors/tfrecords_iterator.py", line 143, in process_records
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1508, in split
    axis=axis, num_split=num_or_size_splits, value=value, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 8883, in split
    "Split", split_dim=axis, value=value, num_split=num_split, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 709, in _apply_op_helper
    (key, op_type_name, attr_value.i, attr_def.minimum))
ValueError: Attr 'num_split' of 'Split' Op passed 0 less than minimum 1.

I can’t really tell what’s going on behind the scenes in the TLT code, so I have no idea what this means for me or how to debug it. How can I fix this so I can retrain this model with my own dataset?

Thank you in advance for your help.

Morganh · August 6, 2020, 5:57am

Your command seems to be missing model. Could you double check?

tlt-train ssd -e train.spec -r ./pretrained_model --gpus 1 -k $NGC_API_KEY

Morganh · August 6, 2020, 6:00am

Also, please try to add below line in eval_config.

batch_size: 16

ejameson · August 6, 2020, 4:41pm

Perfect, that fixed it! I had to add batch_size: 16, and then that error disappeared.

I also did not realise I had to pass -r ./pretrained_model/tlt_pretrained_object_detection_vmobilenet_v1 instead of -r ./pretrained_model.

With both of those changes, I am successfully training. Thank you very much for your help!

Topic		Replies	Views
tlt-train error when deploy mobilenet_v2 by using DetectNet TAO Toolkit	28	2436	October 12, 2021
Retrain a ssd model TAO Toolkit	7	1037	October 12, 2021
Model retraining warning TAO Toolkit	7	1042	October 12, 2021
SSD terminating training due to invalid loss TAO Toolkit	5	1114	October 12, 2021
Retraining with pretrained tlt models TAO Toolkit	33	2790	October 12, 2021
An error occurred while training with TLT TAO Toolkit	11	720	October 12, 2021
Training with TLT a detectnet_v2 resnet18 pre-trained model failed TAO Toolkit	2	618	October 12, 2021
Unable to train SSD-Resnet-18 TAO Toolkit	16	2017	October 12, 2021
Training detectnet_v2 Issue TAO Toolkit	15	1893	October 12, 2021
SSD Mobilenet_v2 with Kitti dataset training error TAO Toolkit	11	509	November 7, 2023

Unable to retrain pre-trained SSD Mobilenet v1

Related topics